Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides an entity information extraction method for a power defect text based on an improved Transformer encoder, which is characterized in that a pre-training language model is introduced on the basis of an original NER model based on a Transformer to convert text data into word and phrase vectors, a dictionary obtained by a large amount of linguistic data word segmentation is introduced, word information is fused on the basis of character information, a network is updated in a graph mode, characters, words and global information are better fused, and the entity information of the power defect text can be more accurately extracted.
The invention can be achieved by adopting the following technical scheme:
the entity information extraction method of the power defect text based on the improved Transformer encoder comprises the following steps:
s1, introducing a defect recording data text of secondary equipment of an electric power system, and labeling the data text;
s2, introducing a pre-training model, a dictionary, a fine-tuning TENER model and a conditional random field model, building a CWG-TENER model, and performing optimization training on the CWG-TENER model by using a data text with a label to obtain a power equipment defect text information extraction model;
and S3, inputting the defect text of the electric power equipment with the information to be extracted into the defect text information extraction model of the electric power equipment to obtain the extracted information.
Specifically, the step S2 includes:
s21, introducing a pre-training model and a dictionary, extracting character vectors of a data text and word vectors of words of the dictionary, wherein the dictionary is obtained based on a plurality of original corpus participles;
s22, the extracted character vectors form a character vector set C, the data text is matched with words in a dictionary, and word vectors corresponding to the matched words form a word vector set W:
s23, building a word graph CWG model;
s24, replacing a CRF layer of the transform model with a full connection layer to enable the output dimension to be the same as the word and phrase vector dimension, and obtaining a fine tuning TENER model;
s25, taking the character vector set C and the word vector set W as the input of a fine tuning TENER model to obtain an initial value C of the feature vector of the output node 0 Initial value W of feature vector of sum edge 0 The initial value C of the node feature vector 0 Initial value W of feature vector of sum edge 0 Separately replacing nodes and CWG modes of a CWG modelType edge, defining the initial value of CWG model global variable as g 0 ;
S26, respectively carrying out aggregation calculation on nodes of the CWG model, edges of the CWG model and global variables of the CWG model to obtain a character vector after first aggregation
Word vector
And a global vector
S27, using character vector
Word vector
And a global vector
Replacing nodes of the CWG model, edges of the CWG model and global variables of the CWG model;
s28, updating the character vector and the word vector by finely adjusting the TENER model, and calculating the updating output of the global vector by an LSTM network state updating formula;
s29, respectively replacing the updated character vector, word vector and global vector with the node of the CWG model, the edge of the CWG model and the global variable of the CWG model, and aggregating the node of the CWG model, the edge of the CWG model and the global variable of the CWG model;
s210, circulating the step S28 to the step S29 for T times to obtain a final character feature vector set;
s211, inputting the final character feature vector set into a conditional random field model CRF, and calculating to obtain an output optimal label sequence;
and S212, optimizing the model parameters by using an Adam optimizer according to the optimal label sequence, and circularly training for a preset number of times to obtain the electric power equipment defect text information extraction model.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the invention provides an entity information extraction method for a power defect text based on an improved Transformer encoder, which comprises the steps of introducing a pre-training language model, a dictionary, a fine-tuning TENER model and a conditional random field model, building a CWG-TENER model, performing optimization training and test selection on the model by using a labeled power system secondary equipment defect text to obtain a power equipment defect text information extraction model, wherein the model can be used for extracting entity information related to the power system secondary equipment defect text, more effectively extracting entity information required by the power system secondary equipment defect text, facilitating the subsequent building of a knowledge graph and providing an auxiliary decision-making function when the power system secondary equipment fails.
Detailed Description
The technical solutions of the present invention will be described in further detail with reference to the drawings and examples, and it is obvious that the described examples are some examples of the present invention, but not all examples, and the embodiments of the present invention are not limited thereto. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Example 1:
in this embodiment, based on the problem of extracting the "defect phenomenon" information in the text of the functional defect of the secondary device in the power system, a "word graph" model is constructed for the text of the functional defect of the secondary device in the power system, a Transformer-based improved encoder suitable for NER is used to perform aggregation update on a neural network of the graph, finally, a conditional random field model is used to output a labeling sequence for the text of the functional defect of the secondary device in the power system, and the "defect phenomenon" information is extracted according to the labeling sequence.
As shown in fig. 1, the embodiment provides a method for extracting entity information of a power defect text based on an improved Transformer encoder, which specifically includes the following steps:
s1, introducing a defect recording data text of the secondary equipment of the power system, and labeling the data text, wherein a data text labeling result is shown in fig. 2.
Taking the extracted phrase as the "defect phenomenon" as an example, the first phrase in the text for representing the defect phenomenon is labeled as "B", the rest characters in the phrase are labeled as "I", and the characters irrelevant to the defect phenomenon in the text are labeled as "O".
The text "protection device operating abnormally" in fig. 2. For example, in the case where the phrase "defect phenomenon" is "device operation abnormality", the "device" is denoted by "B", the "device operation abnormality" is denoted by "I", and "protection" and "are" provided. "not related to" defect phenomenon ", it is labeled" O ".
And S2, introducing a pre-training model, a dictionary, a fine-tuning TENER model and a conditional random field model, building a CWG-TENER model, and performing optimization training on the model by using the data text marked in the S1 to obtain a power equipment defect text information extraction model.
S21: introducing a pre-training model and a dictionary, extracting character vectors of data texts and word vectors of dictionary words, wherein the dictionary is obtained based on a large number of original corpus participles, and the pre-training model is any one of the following models: BERT model, BERT-wwm model, ERNIE model.
BERT (Bidirectional Encoder restationings from Transformers) is a transform-based bi-directional Encoder published by Google in 2018. It is the "first deep bi-directional, unsupervised language representation, trained in advance using only a corpus of plain text". The pretrained BERT model can generate models for processing various natural language processing tasks only by fine adjustment of an additional output layer.
BERT-wwm (wheel Word Masking), which is an upgrade version of BERT issued by Google in 2019, mainly changes a training sample generation strategy of an original pre-training stage. The mask mode of the original Word-Piece is changed into wheel Word Masking. For Chinese applications, i.e., if a character is masked, other characters belonging to a word are masked.
ERNIE, enhanced reproduction through Knowledge Integration is a BERT-based optimization model published in 2019. The mechanism of the mask is mainly improved, and the mask is composed of three types of masks: basic-level masking (word piece), phrase level masking (WWM style), entity level masking.
BERT/BERT-wm uses Wikipedia data for training, and the positive text effect is better; ERNIE uses additional Baidu post, knowledge, etc. network data, which has advantages for informal text (e.g., micro blogs, etc.). If traditional Chinese data is to be processed, BERT or BERT-wwm is used, since there is little traditional Chinese in the word list of ERNIE.
S22: the extracted character vectors form a character vector set C, the data text is matched with words in a dictionary, and word vectors corresponding to the matched words form a word vector set W.
The character vector set C is:
C=[c 1 ,c 2 ,...,c m ]
wherein, c 1 ,c 2 ,...,c m And extracting character vectors for the data text by using a pre-training model, wherein m is the total number of characters in the text.
The sequence shown in fig. 2 "the protection device is operating abnormally. "for example, 9 characters in total, m =9, and let the dimension of the character vector be d model C from this sequence is a d model X m matrix.
The set of word vectors W is:
wherein,
for the word vector corresponding to the matched word, its dimension is the same as the character vector, b
i 、e
i Respectively the head and tail characters of the word corresponding to the ith word vector, and n is the total number of words matched in the dictionary by the data text.
The specific definition of the matching term is: and if a word in the dictionary contains any character in the data text, the word is a matching word.
The sequence shown in fig. 2 "the protection device is operating abnormally. For example, there are 5 words "protect", "device", "run", "abnormal" and "protection", n =5, and W obtained from this sequence is a d model X n matrix.
S23: constructing a CWG (Character-Word Graph) model, wherein the CWG model has a specific structure of a directed Graph formed by data text information, and a Character vector c
i Nodes, word vectors, forming a graph
Form the slave character b
j Corresponding node pointing character e
j The edge of the corresponding node.
The CWG model constructed from the sequence of figure 2 is shown in figure 3.
Then, the cyclic operation of 'update → aggregation → update → 8230 → aggregation' is carried out on the CWG model to extract text features, which specifically comprises the following steps: the character vector C (node), the word vector W (edge) is updated and the global vector g initial value → the character vector C (node), the word vector W (edge), the global vector g aggregate → the character vector C (node), the word vector W (edge), the global vector g update → \8230; → the character vector C (node), the word vector W (edge), the global vector g aggregate are calculated. This process will be described in detail below.
S24: a fine-tuning teer model is introduced to perform the "update" operation of the CWG model. The TENER model is a Transformer model improved based on a named entity recognition task, and the specific mode of fine tuning is as follows: and replacing the CRF layer of the model with a full connection layer to ensure that the output dimension is the same as the word and phrase vector dimension, thereby obtaining the fine tuning TENER model.
In this embodiment, when the nodes are updated in the subsequent steps, the output obtained by the attention mechanism is
The formula of the full connection layer is:
C t+1 =U Linear C t+1 ′+B Linear
wherein, U
Linear 、B
Linear Are trainable parameters in the fully connected layer,
so as to complete a round of updating to obtain a character vector C
t+1 Dimension and C
t Are identical, i.e.
S25: taking the character vector set C and the word vector set W obtained in the S22 as the input of the fine tuning model TENER to obtain an output C 0 And W 0 That is, the character vector and the word vector obtained after the first round of "update" operation are used as the initial values of the feature vectors of the nodes and the edges to replace the CWG module of S23Nodes and edges of the pattern. Simultaneously defining the initial value of the global variable of the CWG model as g 0 =average(C,W);
S26: respectively aggregating the feature vector, the edge feature vector and the global variable of the character nodes of the CWG to obtain a character vector after the first aggregation
Word vector
And a global vector
The specific method comprises the following steps:
the aggregation formula of the nodes is as follows:
wherein i represents the ith character and t represents the tth round of updating.
Aggregating the feature vectors of the preceding character nodes for the t-th round,
for the aggregated character node feature vector,
is composed of
The feature vector of the predecessor node of (a),
is composed of
The incoming edge feature vector of (a) is,
representing the concatenation of two vectors, and the MultiAtt () representing the aggregation in a multi-headed attentive manner.
The aggregation formula of the edges is:
wherein,
the feature vector of the edge pointing from node b to node e before the t-th round of aggregation,
is the feature vector of the edge after the aggregation,
is equal to the edge w
b,e All characters corresponding to the word match correspond to a set of feature vectors.
The calculation formula of the global variable is as follows:
wherein,
for a set of feature vectors corresponding to all characters in the input text sequence,
forming a word vector set for the word vectors corresponding to all the matched words, g
t For the global vector before the t-th aggregation,
is a global vector after character vector information is merged in the t-th aggregation process,
is a global vector after word vector information is merged in the t-th round aggregation process,
and aggregating the obtained final global vector for the t round.
S27: by character vector
Word vector
And a global vector
And replacing the feature vector, the edge feature vector and the global feature vector of the character node of the CWG model.
S28: and updating the character vectors and the word vectors by finely adjusting the TENER model, and calculating the updating output of the global vectors by an LSTM network state updating formula.
And S281, carrying out t-round aggregation according to the feature vectors of the character nodes, and inputting the output of the t-round aggregation as a fine-tuning TENER model to obtain a new character vector. Specifically, as described in step S26
Structure of the device
As an input to the fine-tuning TENER model, an update is performed,
the concrete formula is as follows:
wherein i represents the ith character and t represents the tth round of updating.
The feature vectors of the character nodes before the aggregation for the t-th round,
the character node feature vectors after the t-th round of aggregation are obtained,
for the aggregated global vector of the t-th round,
inputs for the constructed post-fine-tuned TENER model.
Will be described in
Inputting the fine tuning TENER model for updating to obtain an output C
t+1 The method specifically comprises the following steps:
wherein, FTTENER c () Represents a fine-tuned tee model for the character vector.
The fine-tuning TENER model contains an attention mechanism and position codes, wherein the attention mechanism comprises a single-head attention mechanism and a multi-head attention mechanism.
Relative to
The position code of (a) is:
wherein i and j represent the ith character and the jth character respectively, and p
i,j Is composed of
Relative to
The relative position of (2 k), (2k + 1) is the index of the element in a word vector, d
input As input dimension, R, of FTTENER model
ij To be the final
Relative to
The position of (2) is encoded.
In this embodiment, when the single-headed attention mechanism is used in the fine-tuning tee model, the specific formula is as follows.
Input is as
Wherein d is
model M is the total number of characters in the text, which is the dimension of the character vector.The input is composed of three learnable matrices
Projected into different spaces, attention mechanism output is given by the following equation:
wherein Q is the query vector in the attention mechanism, K is the key vector in the attention mechanism, V is the value vector in the attention mechanism, Q
y Is the query vector, K, for the y-th text character
z Is the key vector for the z-th text character,
representing the transpose of the vector, z being the character number noted for the y-th text character, A
y,z Indicating the attention value of the y-th text character to the z-th text character. The single-head attention mechanism output is shown as S24:
from said position-coding formula, when d
input =5d
model ,d
model Encode the dimension of the vector for the character, then
Relative to
Is encoded as:
when a multi-head attention mechanism is used in the fine-tuning TENER model for improving the self-attention capacity, n groups of mapping matrixes are set
The output equation is as follows:
where n is the number of heads and the superscript h is the head index, i.e., Q
(h) 、K
(h) 、V
(h) Respectively as the query vector and the key vector value vector of the h head,
is a learnable matrix corresponding to the above-mentioned three vectors,
is the query vector at the h-th head of the y-th text character,
is the transpose of the h-th head of the z-th text character's key vector,
attention value, head, at h head for y text character to z text character
(h) Is the output of the h head in the multi-head attention mechanism.
To learn the parameters, an output C is obtained
t+1 ' is:
C t+1 ′=W o [head (1) ;...;head (n) ]
dimension d at this time
input =5d
model ,
Relative to
The position coding of (2) is the same as in the single head case.
And S282, carrying out t-round aggregation according to the feature vectors of the edges, and taking the output of the t-round aggregation as the input of the fine-tuning TENER model to obtain a new word vector. Specifically, the method described in S26
Structure of the device
As an input of the fine-tuned tee model, a specific formula is as follows:
wherein i represents the ith edge and t represents the tth round of updating.
Aggregating the feature vectors of the front edge for the t-th round,
for the post-polymerization of the t-th roundThe number of the eigenvectors is the sum of the average,
for the aggregated global vector of the t-th round,
and inputting the constructed fine-tuned TENER model.
Will be described in
Inputting the fine-tuned TENER model for updating to obtain an output W
t+1 :
Wherein, FTTENER w () Represents a fine-tuned tee model for the word vector.
The FTTENER model in S282 is slightly different from the model in S281 in terms of the position code of the input, specifically:
wherein,
means between the ith word start character and the jth word start characterThe distance of (a) to (b),
indicating the distance between the ith word start character and the jth word end character, and so on.
Wherein p is
pos Vectors are encoded for four relative positions of the words, pos being
(2k) Is an index of an element in a word vector, d
input Is the input dimension of the FTTENER model.
In this embodiment, as shown in figure 3,
the rest relative position codes are obtained by analogy,
d
model is the coding dimension of the word vector.
Wherein, U
r Is a trainable parameter, representing a linear layer, such that the final R
ij Is the same as the input vector dimension, in this embodiment,
R
ij the position of the final ith word vector relative to the jth word vector is encoded.
S283, updating network state through LSTMFormula calculation updating global variables
To obtain g
t+1 LSTM networks, long and short term memory networks, are commonly used for the treatment of NLP problems,
the updating method of (2) refers to the updating method of the state value therein, and the calculation formula is as follows:
g t+1 =f t+1 ⊙g t +i t+1 ⊙u t+1
wherein U, V and b are trainable parameters,
denote i and f, respectively, i.e. equation one actually contains two equations, u
t +1 、i
t+1 、f
t+1 、
Intermediate variables introduced for clarity of formulation.
S29: respectively replacing nodes, edges and global variables of the CWG model with the updated character vector, word vector and global vector, and aggregating the nodes, edges and global variables of the CWG model;
s210: and (5) circulating the step S28 to the step S29 for T times to obtain a final character feature vector set.
S211: inputting the character vector set corresponding to the finally obtained node into a conditional random field model CRF, and calculating to obtain an optimal label sequence, wherein the specific calculation formula comprises the following steps:
wherein i represents the ith node;
transpose for the final eigenvector of the ith node; l
i A label for the ith node;
and
to a label l
i-1 And l
i Trainable parameters of (a);
intermediate variables introduced for the clear representation of the formula, the sum
i-1 、l
i 、
Calculation formula related to three variables
And
the same meaning is applied;
in order to optimize the sequence of the tag,
a tag representing the ith node in the optimal tag sequence,
the same meaning is applied; y(s) is the set of all labels in the current situation s;
represents the best tag sequence under the current situation s as
The probability of (c).
For the training process, the loss function is:
wherein N is the total number of tag sequences contained in Y(s).
In this embodiment, if the data set in which the sequence shown in fig. 2 is located is taken as a training set, a label dictionary is defined:
tag2label={B:0,I:1,O:2}
then for the sequence it is possible to,
is a random combination of three labels, and has a total of 3
9 Possible values, i.e. Y(s) totalling 3
9 And (4) each element.
For the test and decoding process, the optimal tag sequence y is found by * :
y * =argmax y∈Y(s) p(y|s)
Wherein p (y | s) represents the probability of an arbitrary tag sequence y in the current situation s, the best tag sequence y * A labeling result corresponding to each input character.
In this embodiment, if the sequence shown in FIG. 2 is used as the test set to define the tag dictionary, the best result y is completely correct * =2,2,0,1,1,1,1,1,2。
S212: sequence y obtained according to S28 * And optimizing the model parameters by using an Adam optimizer, and circularly training for a certain number of times to obtain the CWG-TENER model with better effect for extracting the defect text information of the power equipment.
In this embodiment, the criterion of the model effect is the performance of the model on the test set, specifically, the three named entity recognition task common indicators: precision, recall, and F1 value.
And S3, inputting the defect text of the electric power equipment with the information to be extracted into the defect text information extraction model of the electric power equipment to obtain the extracted information.
The information to be extracted may include: and recording information such as defect phenomena, defect reasons, solution measures and the like related in the text of the defect record of the power secondary equipment.
The obtained information to be extracted can be used for constructing a subsequent knowledge graph so as to inquire the solution when the secondary equipment of the power system fails and provide an aid decision function. Compared with the existing model, the model can obtain more complete and accurate entity information, so that the finally obtained decision-making assisting system is more practical.
In this embodiment, the target extraction information is defect phenomenon information, and the overall model architecture is shown in fig. 4.
Firstly, inputting a data text into a pre-training language model BERT/BERT-wwm/ERNIE, converting characters in the data text into character vectors, and converting words in a dictionary into word vectors with the same dimension through the pre-training language model. The method comprises the steps of utilizing a TENER model to carry out first-time feature extraction on character vectors and word vectors to obtain initial values of the character vectors and the word vectors entering' aggregation → update → aggregation → 82308230, 8230, circulation and simultaneously calculating the initial values of global vectors. And the character vector, the word vector and the global vector are subjected to aggregation to obtain aggregation output. And then adding the character vectors and the word vectors into the position codes, inputting the position codes into a transform layer with N heads, namely an 'updating' layer, obtaining updating output, and meanwhile calculating the updating output of the global vectors. And finally inputting the character vector, the word vector and the global vector into the linear layer to obtain output with the dimension being the same as the initial dimension, inputting the output into the aggregation layer again, and circulating for T times. And after the last polymerization operation, inputting the final character feature vector into a CRF layer to obtain the final label output.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such modifications are intended to be included in the scope of the present invention.