CN117634463A

CN117634463A - Entity relationship joint extraction method, device, equipment and medium

Info

Publication number: CN117634463A
Application number: CN202311658221.3A
Authority: CN
Inventors: 张延斌; 闫圣学; 柯国富
Original assignee: GHT CO Ltd
Current assignee: GHT CO Ltd
Priority date: 2023-12-05
Filing date: 2023-12-05
Publication date: 2024-03-01

Abstract

The invention discloses a method, a device, equipment and a medium for entity relation joint extraction, wherein the method comprises the following steps: inputting the text data of the power dispatching to be processed into a pre-training language model taking a nerve embedding module as an input layer to extract semantic features of the text data to obtain a plurality of text expression vectors, then inputting the text expression vectors into a label prediction module to obtain entity labeling label probability distribution, and then determining optimal entity labeling labels of the text expression vectors through a CRF module and carrying out word embedding to obtain optimal entity labeling label embedding vectors; splicing the entity labeling tag probability distribution and the optimal entity labeling tag embedded vector, and acquiring a target relationship between optimal entity labeling tags of each text representing vector by combining an information matrix through sigmoid; wherein the neural embedding module is trained in advance with the power schedule text data set generated based on the text generation model. The invention can improve the accuracy of extracting the entity relationship in the power dispatching field.

Description

Entity relationship joint extraction method, device, equipment and medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a terminal device, and a computer readable storage medium for entity relationship joint extraction.

Background

The entity relationship joint extraction technology can acquire the relationship between key entities in the text, can help the computing mechanism to solve the human intention, and plays a vital role in human-computer interaction application. In the prior art, when facing the entity relation joint extraction task in the power dispatching field, the dispatching text corpus in the power dispatching field is usually obtained from on-site actual collection, is generated after voice recognition and manual labeling, has small data volume, and is difficult to meet the requirement of language model training.

Disclosure of Invention

The invention provides a method, a device, equipment and a medium for entity relation joint extraction, which expand a power dispatching text data set through a text generation model, and extract semantic features of the power dispatching text data to be processed by using a nerve embedding module, so that the applicability of a language model in the power dispatching field can be enhanced, and the accuracy of entity relation joint extraction aiming at the power dispatching field is improved.

In order to solve the above technical problems, a first aspect of the present invention provides a method for entity relationship joint extraction, including the following steps:

inputting the text data of the power dispatching to be processed into the pre-training language model based on the pre-training language model taking a preset nerve embedding module as an input layer so as to extract semantic features of the text data of the power dispatching to be processed and obtain a plurality of text expression vectors;

inputting a plurality of text expression vectors into a preset label prediction module to obtain entity labeling label probability distribution corresponding to each text expression vector;

according to the entity labeling tag probability distribution, determining optimal entity labeling tags of each text representing vector under the constraint of the dependency relationship among preset labeling tags by using a Viterbi algorithm through a CRF module, and carrying out word embedding on each optimal entity labeling tag to obtain an optimal entity labeling tag embedded vector;

splicing the entity labeling tag probability distribution and the optimal entity labeling tag embedded vector to form a relation extraction input quantity, and acquiring a target relation between optimal entity labeling tags of each text representing vector through a sigmoid function according to the relation extraction input quantity and a preset information matrix;

wherein the information matrix comprises relationships between different text representation vectors;

the nerve embedding module is obtained by training a transducer language model by using a power dispatching text data set generated based on a preset text generation model;

and the label prediction module and the CRF module perform training by using the power dispatching text data set marked by the entity relationship in advance.

Preferably, the text generation model is a SeqGAN network;

the method specifically generates the power dispatching text data set by the following steps:

inputting preset power dispatching text data into the SeqGAN network, obtaining power dispatching generation text data conforming to the power dispatching text data, and constructing the power dispatching text data set according to the power dispatching generation text data.

As a preferred scheme, the method specifically obtains the power dispatching text data set marked by the entity relation through the following steps:

marking the power dispatching text data set by utilizing a BIO marking strategy based on a preset entity set and a preset relation set to obtain a plurality of five-tuple; the five-tuple comprises the position of a word or a word in a sentence, the word or the word corresponding to the current five-tuple, an entity labeling label, a relation and the subscript position of the relation word;

sequentially packaging five-tuple corresponding to each word or word in each sentence in the power dispatching text data set into each sentence, and acquiring text expression vectors corresponding to the words or words in each five-tuple, an entity labeling tag list corresponding to the entity labeling tag and a relation matrix based on a plurality of five-tuple in each packaged sentence; the relation matrix is obtained by calculating based on the subscript position of the relation word in each five-tuple, the total length of the relation set and the position index of the relation in the relation set;

and carrying out data filling processing on each sentence in the power dispatching text data set according to the dimension of the longest sentence in the power dispatching text data set, and forming the power dispatching text data set after the entity relationship marking based on text representing vectors corresponding to characters or words in each five-tuple in each sentence, an entity marking label list corresponding to the entity marking label and a relationship matrix.

Preferably, the method specifically acquires the information matrix by the following steps:

initializing a zero matrix as an initial information matrix according to the sentence length of the power dispatching text data set and the total length of the relation set;

and transposed vectors which are not 0 in the relation matrix are used as column vectors of the initial information matrix and are filled with 1, so that the information matrix is obtained.

As a preferred solution, the extracting input quantity and a preset information matrix according to the relation, and obtaining a target relation between optimal entity tags of each text expression vector through a sigmoid function, specifically includes the following steps:

inputting the relation extraction input quantity into a linear neural network taking Relu as an activation function, and calculating a relation score corresponding to each text expression vector so as to obtain a prediction relation corresponding to each text expression vector;

and according to the relation score and the information matrix, acquiring a target relation between optimal entity labels of the text expression vectors through a sigmoid function.

Preferably, the pre-training language model is specifically a BERT model.

Preferably, the label prediction module is specifically a BiLSTM model.

A second aspect of an embodiment of the present invention provides an entity-relationship joint extraction apparatus, including:

the text representation vector generation module is used for inputting the power scheduling text data to be processed into the pre-training language model based on the pre-training language model taking the preset nerve embedding module as an input layer so as to extract semantic features of the power scheduling text data to be processed and obtain a plurality of text representation vectors;

the entity labeling tag probability distribution acquisition module is used for inputting a plurality of text expression vectors into a preset tag prediction module to acquire entity labeling tag probability distribution corresponding to each text expression vector;

the optimal entity labeling label acquisition module is used for determining optimal entity labeling labels of each text representing vector under the constraint of the dependency relationship among preset labeling labels by using a Viterbi algorithm through the CRF module according to the entity labeling label probability distribution, and carrying out word embedding on each optimal entity labeling label to obtain an optimal entity labeling label embedded vector;

the entity relation extraction module is used for splicing the entity label probability distribution and the optimal entity label embedded vector to form relation extraction input quantity, and obtaining a target relation between optimal entity label of each text expression vector through a sigmoid function according to the relation extraction input quantity and a preset information matrix;

A third aspect of an embodiment of the present invention provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the entity relationship joint extraction method according to any one of the first aspects when the processor executes the computer program.

A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, where the computer program when executed controls a device in which the computer-readable storage medium is located to execute the entity-relationship joint extraction method according to any one of the first aspects.

Compared with the prior art, the method and the device have the advantages that the text generation model is used for expanding the power dispatching text data set, and the nerve embedding module is used for extracting semantic features of the power dispatching text data to be processed, so that applicability of the language model in the power dispatching field can be enhanced, and accuracy of entity relation joint extraction in the power dispatching field is improved.

Drawings

FIG. 1 is a flow chart of a method for entity relationship joint extraction in an embodiment of the invention;

FIG. 2 is a schematic diagram of a physical relationship joint extraction device according to an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, a first aspect of the embodiment of the present invention provides a method for entity relationship joint extraction, including steps S1 to S4 as follows:

step S1, inputting to-be-processed power dispatching text data into a pre-training language model based on the pre-training language model taking a preset nerve embedding module as an input layer so as to extract semantic features of the to-be-processed power dispatching text data and obtain a plurality of text expression vectors;

s2, inputting a plurality of text expression vectors into a preset label prediction module to obtain entity labeling label probability distribution corresponding to each text expression vector;

step S3, determining the optimal entity labeling labels of the text representing vectors under the constraint of the dependency relationship among preset labeling labels by using a Viterbi algorithm through a CRF module according to the entity labeling label probability distribution, and carrying out word embedding on the optimal entity labeling labels to obtain optimal entity labeling label embedded vectors;

step S4, the entity labeling label probability distribution and the optimal entity labeling label embedded vector are spliced to form a relation extraction input quantity, and a target relation between optimal entity labeling labels of each text expression vector is obtained through a sigmoid function according to the relation extraction input quantity and a preset information matrix;

Preferably, the pre-training language model is specifically a BERT model. Specifically, BERT, collectively Bidirectional Encoder Representation from Transformers, is an unsupervised pre-trained language model oriented to natural language processing tasks. BERT uses Transformer Decoder and fuses the Masked Multi-Head Attention as the extractor and uses the mask training method that is matched to it. The model functions to transform a piece of input text into a set of representation vectors, where each representation vector corresponds to each segmentation unit (word or word) of the input text, and each vector fuses global information of the text. The universal BERT model uses two pre-training targets to complete the learning of text content characteristics; masked Language Model (MLM) and Next Sentence Predication (NSP). Wherein certain positions in the input sequence are masked randomly based on a masked language model (Masked Language Model, MLM) and then predicted by the model such that the result of the model encoding contains context information of the context at the same time. The NSP task predicts whether the positions of two sentences are adjacent by learning the characteristics of the relationships between the sentences. Aiming at single sentence processing in the power dispatching field, the embodiment does not need to consider the relation among sentences, so NSP training tasks are abandoned, only MLM tasks are reserved, and therefore the characteristic learning capacity of a model for single sentences is optimized. Furthermore, the Embeddings of BERT includes three parts of input features as integrated feature vectors: semantic features, keyword features, and location features. In the fine tuning stage of the model, considering that the performance of the model in a specific field can be improved by extracting semantic features of an unfixed instruction set in the power dispatching field in advance and then sending the same into the model, a nerve embedding module taking a converged transducer as a reference network is introduced into an input layer of the BERT model so as to extract the deeper semantic features of an input text. The nerve embedding module is described below.

First, a small transducer language model is trained in an unsupervised training mode using the power dispatching text dataset as training data. Then for each text sample, starting from the original coding network model, only a few selected layers L are trimmed ₁ ,L ₂ ,…,L _M While keeping all other layers frozen. Starting trimming using new text data, once trimming of the text sample is completed, measuring each layer L previously selected _j New weight W 'of (2)' _j And the original weight W _j And normalizing the obtained vector. The neural embedded coding of the text is obtained by concatenating the normalized vectors, the formula is as follows:

E＝E _c /|E _c |；

and adding the vectors obtained after serial normalization into the BERT module as new semantic features, so that the coding performance of the universal pre-training language model in the field of power dispatching can be enhanced.

Further, a plurality of text expression vectors are input into a preset label prediction module, and entity labeling label probability distribution corresponding to each text expression vector is obtained. Preferably, the label prediction module is specifically a BiLSTM model.

Specifically, LSTM is a recurrent neural network that can effectively handle long-term dependencies in sequence data, such as text data. Since the conventional LSTM can only consider context information before the current time, it is impossible to capture subsequent context information. The BiLSTM introduces a reverse network that can take into account both contextual information. Through the extraction of the bi-directional information, biLSTM can better understand the semantics and grammar structures in natural language. Therefore, in this embodiment, a BiLSTM (bidirectional long short time memory network) is used as a coded label prediction module, and each word in the text sequence is classified to obtain its entity labeling label. The principle is that a full connection layer with the size of [ hidden_dim, num_label ] is connected behind a hidden layer of BiLSTM, so that the probability that each Token corresponds to each Label can be obtained.

The CRF is used for solving the constraint problem existing in sequence labeling and learning the dependency relationship among labels. It is essentially a reasonable label prediction of elements in a sentence. If the result of the fully connected layer of BiLSTM is taken as the emission probability, the CRF plays a role in counting the direct transition probability of each Label through the Viterbi algorithm and limiting the Label result predicted by BiLSTM. For example, tag number 1I cannot be followed by O, B cannot be followed by B, etc. The CRF can learn more information in the training corpus and characterize more features. By BIO marking strategy, CRF is used to introduce the dependency relationship between labels. The effect of BiLSTM+CRF is: 1. calculating each word through BiLSTM to obtain the probability of different entity labeling labels; 2. the Viterbi algorithm is used in the CRF to obtain the optimal entity labeling label after each word is restrained, but not the label with highest probability after being output by BiLSTM.

Preferably, the text generation model is a SeqGAN network;

In particular, the SeqGAN network comprises a generator and a arbiter, G _θ Is a text generator with a parameter theta and is used for training and generating word sequences. The text generator uses ConvLSTM as an encoder and decoder framework. And updating the generator by a strategy gradient method by adopting a random parameterization strategy. ConvLSTM (convolutional long-short-term memory network) is used as an improvement of LSTM (long-short-term memory network), and replaces multiplication operation in the traditional LSTM network with convolution operation, so that local characteristics can be described, time sequence information such as text information can be processed better, and redundancy caused by using the traditional multiplication operation is avoided. Trained, generator G _θ A word sequence y= { Y may be generated ₁ ,y ₂ ,…,y _t ,…,y _n All of the tokens y _i All from the vocabulary. In addition to this, text generator G _θ Can be regarded as a reinforcement learning process: at time step t, state s is word sequence Y that has been generated at the current state _i ＝{y ₁ ,y ₂ ,…,y _t-1 Next action a is to select next word y _i The selection strategy of the system is G _θ (y _t |Y _1:t-1 ). After the selection policy is determined, the action is determined so that the term y _t And is also selected.

D _ψ Is a discriminator based on cross entropy loss parameter, and its task is to schedule text and generator G for reality _θ Generated byThe text is distinguished, and the distinguishing result directly directs the updating of the generator reversely. The update strategy is as follows: the probability that the decision result of the arbiter is true is used as the reward. The new lemma is generated by using a Monte Carlo tree search method by guiding the update generator using a policy gradient method. The following describes specific implementation schemes:

generator G _θ The training goal of (2) is the maximum expectation of rewards:

wherein R is _T Is a reward of complete generation sequence, and is determined by a discriminator D _ψ Calculating;is a function of the action value of the whole sequence, i.e. action a is taken in state s and according to policy G _θ The expectation of the jackpot that results after the end of the sequence is performed. Distinguishing device D _ψ The closer the output result of (2) is to 1, the representation generator G _θ The closer the generated text is to the real schedule text, the closer to 0 indicates that the generated text is a false text. D (D) _ψ Determining a probabilistic reward that generates samples as true:

calculating the reward is only meaningful when a sequence is completed to be generated. While for partial sequences Y which are not completely generated _1:t SeqGAN adopts a developing strategy G of Monte Carlo tree search _β To obtain the remaining n-t tokens. Let us assume that a Monte Carlo tree search is performed N times, the entire search process can be expressed as:

wherein G is _β And G _θ Is the same model. Iterative update generator G by the above method _θ Sum discriminator D _ψ When the text generated by the generator can cheat the arbiter, namely, the automatically generated text sequence meets the requirement of the data set, according to the strategy, the automatic generation of the text oriented to power dispatching can be realized.

Specifically, the present embodiment first creates a relation set relationships_set and an Entity set entity_set in advance. And then marking the power dispatching text data set by using a BIO marking tool according to the Entity set entity_set by using a BIO marking strategy, and obtaining a text data set with N rows and 5 columns on the assumption that the sentence length is N, wherein the content of each row is a five-tuple:

<Token_id,Words,BIO,Relations,Relations_id>

where token_id is the position of the word in the sentence, words is the word, BIO is the entity labeling tag, references is the entity, references_id is the subscript position of the corresponding relationship word (the position of the last word in the sentence).

Then, all training data are traversed, and quintuples of each word in each sentence are used as a list to be respectively packaged into the sentences according to the word sequence. And traversing each five-tuple in the packaged sentence, acquiring a text representation vector processed by the pre-training language model according to Words, acquiring an Entity labeling tag list entity_id according to BIO, and acquiring a relation matrix relations_matrix according to Relations. The calculating method of the relation_matrix comprises the following steps: relation_id (relation_set) +relation; where references is the position index of the relationship in the relationship set, and len (references_set) is the total length of the relationship set.

Further, data filling processing is carried out on each sentence in the power dispatching text data set according to the dimension of the longest sentence in the power dispatching text data set, and the purpose of filling 0 is to equalize the dimension of training data of the same batch.

Specifically, assuming that the sentence length is N, initializing a zero Matrix of [ N, n_len (relations_set) ] as an initial information Matrix, traversing a relation Matrix relations_matrix, and transposing a vector which is not 0 in the relation Matrix as a column vector of the initial information Matrix and filling 1, wherein the meaning is that, firstly, in a longitudinal direction, a word represented by a number 1 position directly has a relation with the word; second, in the lateral sense, it is expressed what relationship a word directly has.

According to the entity relationship joint extraction method provided by the embodiment of the invention, the text generation model is used for expanding the power dispatching text data set, and the nerve embedding module is used for extracting the semantic features of the power dispatching text data to be processed, so that the applicability of the language model in the power dispatching field can be enhanced, and the accuracy of entity relationship joint extraction in the power dispatching field is improved.

Referring to fig. 2, a second aspect of the embodiment of the present invention provides an entity relationship joint extraction apparatus, including:

the text representation vector generation module 201 is configured to input the power scheduling text data to be processed into a pre-training language model based on the pre-training language model using the preset neural embedding module as an input layer, so as to extract semantic features of the power scheduling text data to be processed, and obtain a plurality of text representation vectors;

the entity labeling tag probability distribution obtaining module 202 is configured to input a plurality of text expression vectors into a preset tag prediction module, so as to obtain entity labeling tag probability distribution corresponding to each text expression vector;

the optimal entity labeling label obtaining module 203 is configured to determine, according to the entity labeling label probability distribution, an optimal entity labeling label of each text representing vector under a constraint of a dependency relationship between preset labeling labels by using a Viterbi algorithm through a CRF module, and perform word embedding on each optimal entity labeling label to obtain an optimal entity labeling label embedding vector;

the entity relation extraction module 204 is configured to splice the entity label probability distribution and the optimal entity label embedded vector to form a relation extraction input quantity, and obtain, according to the relation extraction input quantity and a preset information matrix, a target relation between optimal entity labels of each text expression vector through a sigmoid function;

It should be noted that, the entity relationship joint extraction device provided by the embodiment of the present invention can implement all the processes of the entity relationship joint extraction method described in any one of the embodiments, and the functions and the implemented technical effects of each module in the device are respectively the same as those of the entity relationship joint extraction method described in the embodiment, and are not repeated herein.

A third aspect of the embodiment of the present invention provides a terminal device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the entity relationship joint extraction method according to any embodiment of the first aspect when executing the computer program.

The terminal equipment can be computing equipment such as a desktop computer, a notebook computer, a palm computer, a cloud server and the like. The terminal device may include, but is not limited to, a processor, a memory. The terminal device may also include input and output devices, network access devices, buses, and the like.

The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Applicatio n Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which is a control center of the terminal device, and which connects various parts of the entire terminal device using various interfaces and lines.

The memory may be used to store the computer program and/or module, and the processor may implement various functions of the terminal device by running or executing the computer program and/or module stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.

A fourth aspect of the embodiments of the present invention provides a computer readable storage medium, where the computer readable storage medium includes a stored computer program, where when the computer program runs, the device where the computer readable storage medium is controlled to execute the entity relationship joint extraction method according to any one of the embodiments of the first aspect.

From the above description of the embodiments, it will be clear to those skilled in the art that the present invention may be implemented by means of software plus necessary hardware platforms, but may of course also be implemented entirely in hardware. With such understanding, all or part of the technical solution of the present invention contributing to the background art may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the embodiments or some parts of the embodiments of the present invention.

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the invention, such changes and modifications are also intended to be within the scope of the invention.

Claims

1. The entity relationship joint extraction method is characterized by comprising the following steps of:

2. The method for entity-relationship joint extraction according to claim 1, wherein the text generation model is a SeqGAN network;

3. The method for extracting entity-relationship association according to claim 1, wherein the method specifically comprises the following steps of obtaining a power dispatching text data set after entity-relationship marking:

4. The method for entity-relationship joint extraction according to claim 3, wherein the method specifically obtains the information matrix by:

5. The method for extracting entity-relationship association according to claim 1, wherein the extracting input quantity and the preset information matrix according to the relationship, obtaining the target relationship between the optimal entity labels of the text expression vectors through a sigmoid function, specifically comprises the following steps:

6. The method for entity-relationship joint extraction of claim 1, wherein the pre-trained language model is embodied as a BERT model.

7. The method of claim 1, wherein the label prediction module is a BiLSTM model.

8. An entity relationship joint extraction device, comprising:

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the entity-relationship joint extraction method of any one of claims 1 to 7 when the computer program is executed by the processor.

10. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored computer program, wherein the computer program when run controls a device in which the computer readable storage medium is located to perform the entity-relationship joint extraction method according to any one of claims 1 to 7.