CN115935983A

CN115935983A - Event extraction method and device, electronic equipment and storage medium

Info

Publication number: CN115935983A
Application number: CN202211717646.2A
Authority: CN
Inventors: 李晓平; 顾文斌; 杨祎聪; 李松柏; 孙勇
Original assignee: Shanghai Hengsheng Juyuan Data Service Co ltd; Hangzhou Hengsheng Juyuan Information Technology Co ltd
Current assignee: Shanghai Hengsheng Juyuan Data Service Co ltd; Hangzhou Hengsheng Juyuan Information Technology Co ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-04-07

Abstract

The embodiment of the application relates to the field of natural language processing, and provides an event extraction method, an event extraction device, electronic equipment and a storage medium, wherein when a text to be processed is subjected to event extraction, a text classification model is firstly utilized to perform event primary classification on the text to be processed, so that a prediction category label and text heat information of the text to be processed are obtained; then, according to the subject information of each event subject in the text to be processed, and in combination with text heat information, finding out a target event subject matched with the prediction type label from all event subjects; then, because the prediction category label is obtained by clustering at least one event type, event secondary classification is carried out on the text to be processed by utilizing the key feature word bank, and a target event type is restored from the prediction category label, so that an event label of the text to be processed can be obtained; therefore, the event extraction can be realized under the condition that the trigger words do not need to be extracted, and the event extraction efficiency is improved.

Description

Event extraction method and device, electronic equipment and storage medium

Technical Field

The embodiment of the application relates to the field of natural language processing, in particular to an event extraction method and device, electronic equipment and a storage medium.

Background

The event extraction refers to extracting structured event information from a text containing the event information, generally, the event information includes an event type and an event element, wherein the event element includes information such as an event main body, and the event extraction has practical use significance in many fields such as information retrieval.

In the early stage, event extraction generally adopts a mode of pattern matching, and event information is extracted from a text based on a syntax tree or a regular expression. In recent years, with the development of machine learning and deep learning, event extraction using a statistical model and a deep learning model has become the mainstream of research. In the latter, the pipeline extraction and the joint extraction can be further divided according to the arrangement mode of the tasks. The pipeline type extraction is composed of a plurality of independent subtasks, wherein the core link is to extract trigger words, and the extraction of the trigger words is complicated and difficult to exhaust. The joint extraction uniformly processes links such as detecting trigger words, judging event types, extracting event elements and the like, can supplement the mutual influence among texts, and has higher model complexity.

Therefore, the conventional event extraction method is complicated, and thus, the efficiency is low.

Disclosure of Invention

An object of the embodiments of the present application is to provide an event extraction method, an event extraction device, an electronic device, and a storage medium, which can complete event extraction of a text without extracting a trigger word, and improve event extraction efficiency.

In order to achieve the above purpose, the embodiments of the present application employ the following technical solutions:

in a first aspect, an embodiment of the present application provides an event extraction method, where the method includes:

acquiring a text to be processed and each event main body and main body information thereof in the text to be processed;

performing event primary classification on the text to be processed by using a pre-trained text classification model to obtain a prediction category label and text heat information of the text to be processed, wherein the prediction category label is obtained by clustering at least one event type;

obtaining a target event main body matched with the prediction category label according to the main body information of each event main body and the text heat information;

and performing event secondary classification on the text to be processed by utilizing a pre-established key feature word bank, and reducing a target event type from the prediction category label to obtain an event label of the text to be processed, wherein the event label comprises the target event main body and the target event type.

Optionally, the text classification model includes a Bert model and a multi-label classifier, the multi-label classifier including a plurality of category labels;

the step of performing event primary classification on the text to be processed by using a pre-trained text classification model to obtain a prediction category label and text heat information of the text to be processed comprises the following steps:

inputting the text to be processed into the text classification model, and obtaining an embedding sequence of the text to be processed by using the Bert model, wherein the embedding sequence comprises word embedding of a set CLS symbol and word embedding of each word in the text to be processed;

learning semantic information of the text to be processed based on an attention mechanism by using the Bert model, and obtaining an attention matrix corresponding to the text to be processed and an output vector of the CLS symbol; wherein the attention matrix represents the similarity relation between the CLS symbol and each word in the text to be processed;

classifying the output vector by using the multi-label classifier to obtain a probability value of each class label, and taking the class label with the probability value higher than a set threshold value as the prediction class label;

and performing linear transformation on the attention moment array by using the multi-label classifier to obtain the text heat information, wherein the text heat information represents the relevance between the CLS symbol under the prediction class label and each word in the text to be processed.

Optionally, the subject information includes location information;

the step of obtaining a target event subject matched with the prediction category label according to the subject information of each event subject and the text popularity information includes:

according to the position information of each event main body, the text to be processed is divided into sentences to obtain a text unit corresponding to each event main body;

calculating the text heat corresponding to each text unit according to the text heat information;

and taking the event main body corresponding to the text unit with the highest text popularity as the target event main body.

Optionally, the text classification model includes a plurality of category labels, and each category label is obtained by clustering at least one event type; the key characteristic word library comprises a plurality of key characteristic words corresponding to each event type and the weight of each key characteristic word;

the step of performing event secondary classification on the text to be processed by using a pre-established key feature word library and restoring a target event type from the prediction category label comprises the following steps:

performing word segmentation on the text to be processed to obtain a plurality of reference words;

for each event type under the prediction category label, determining each target key feature word of the event type from the plurality of reference words based on the key feature word library;

obtaining the weight of each target key feature word and summing the weights to obtain the weight of the event type;

and taking the event type with the highest weight as the target event type.

Optionally, the step of obtaining the text to be processed and each event subject and subject information thereof in the text to be processed includes:

acquiring an original text;

generating an abstract of the original text through an automatic abstract model to obtain the text to be processed;

and carrying out entity recognition on the text to be processed through an entity recognition model to obtain each event main body and main body information thereof in the text to be processed.

Optionally, the text classification model is trained by:

acquiring a supervised corpus, wherein the supervised corpus comprises a plurality of training samples and an event type of each training sample;

clustering all event types to obtain a plurality of category labels, wherein the category labels comprise at least one event type;

and training the text classification model by using the training samples and the class labels to obtain the trained text classification model.

Optionally, the clustering all event types to obtain a plurality of category labels includes:

performing text steering on each training sample through a pre-trained word embedding model to obtain each word embedding information;

dividing the word embedding information with the same event type into a group, and taking the average value of all the word embedding information in the group as the feature vector of the event type to obtain the feature vector of each event type;

calculating the correlation of every two event types according to the feature vector of each event type;

and performing hierarchical clustering on all event types according to the correlation of every two event types to obtain the plurality of category labels.

Optionally, the text classification model comprises a Bert model and a multi-label classifier, the multi-label classifier comprising the plurality of category labels;

the step of training the text classification model by using the training samples and the class labels to obtain a trained text classification model includes:

inputting the training samples and the class labels into the text classification model, and obtaining a sample embedding sequence of the training samples by using the Bert model, wherein the sample embedding sequence comprises word embedding for setting a CLS symbol and word embedding of each word in the training samples;

learning semantic information of the sample embedding sequence by using the Bert model based on an attention mechanism to obtain an output vector of the CLS symbol;

classifying the output vector of the CLS symbol by using the multi-label classifier to obtain a prediction class label of the training sample;

and training the text classification model based on the class label and the prediction class label of each training sample and a preset loss function to obtain the trained text classification model.

Optionally, the loss function is:

L _total (x ^k ,y ^k )＝[1+γ(1-F1 _body (x _k ,u _k ))]L _DB (x _k ,y _k )

wherein L is _total Representing the loss function, k represents the number of training samples, x represents the training samples, y represents class labels of the training samples, gamma represents the coefficient of the loss of the event subject, F1 _body Indicating event subject accuracy, L _DB Representing a classification loss function;

the event subject accuracy is:

wherein C represents the total number of class labels, i represents the number of class labels of the multi-label classifier; TP, FP and FN represent confusion matrix indexes of event subject classification results in the ith class label of the kth training sample;

the classification loss function is:

wherein, the first and the second end of the pipe are connected with each other,

represents the weight of the ith class label of the k training sample after smoothing, z represents the predicted class label of the training sample, lambda is a hyperparameter influencing the loss weight of the negative sample, v _i The weight bias for the ith class label. />

Optionally, the keyword library is built by:

performing word segmentation on the supervised corpus and removing stop words to obtain a word segmentation result of each training sample;

based on the word segmentation result of each training sample, removing the high-frequency public words of the training sample corresponding to each class label;

screening out a special high-frequency word of a training sample corresponding to the event type aiming at each event type under any one category label to obtain each key feature word of the event type;

for each key feature word of any event type under the category label, obtaining the weight of the key feature word according to the word frequency of the key feature word in the supervised corpus and the sample number of training samples corresponding to the event type;

and obtaining the weight of each key characteristic word of each event type under each category label to obtain the key characteristic word library.

In a second aspect, an embodiment of the present application further provides an event extraction apparatus, where the apparatus includes:

the system comprises an obtaining module, a processing module and a processing module, wherein the obtaining module is used for obtaining a text to be processed and each event main body and main body information thereof in the text to be processed;

the event primary classification module is used for carrying out event primary classification on the text to be processed by utilizing a pre-trained text classification model to obtain a prediction category label and text heat information of the text to be processed, wherein the prediction category label is obtained by clustering at least one event type;

the event main body matching module is used for obtaining a target event main body matched with the prediction category label according to the main body information of each event main body and the text heat information;

and the event secondary classification module is used for carrying out event secondary classification on the text to be processed by utilizing a pre-established key feature word library, reducing a target event type from the prediction category label and obtaining an event label of the text to be processed, wherein the event label comprises the target event main body and the target event type.

In a third aspect, an embodiment of the present application further provides an electronic device, which includes a processor and a memory, where the memory is used to store a program, and the processor is used to implement the event extraction method in the first aspect when executing the program.

In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the event extraction method in the first aspect.

Compared with the prior art, according to the event extraction method, the event extraction device, the electronic device and the storage medium provided by the embodiment of the application, when the event extraction is performed on the text to be processed, the text classification model is firstly utilized to perform the event extraction on the text to be processed for one time

Classifying to obtain a prediction category label and text heat information of the text to be processed; then, according to the main body information of each 5 event main bodies in the text to be processed, combining with text heat information, finding out the main body matched with the prediction category label from all the event main bodies

Matching a target event main body; secondly, clustering at least one event type to obtain a prediction category label, performing secondary event classification on the text to be processed by utilizing the key feature word library, and reducing a target event type from the prediction category label to obtain an event label of the text to be processed; therefore, the event extraction can be realized under the condition that the trigger words do not need to be extracted, and the event extraction efficiency is improved.

Drawings

Fig. 1 illustrates a clustering example diagram of event types provided in an embodiment of the present application.

Fig. 2 shows a schematic structural diagram of a text classification model provided in an embodiment of the present application.

Fig. 3 shows a flowchart of an event extraction method provided in an embodiment of the present application.

Fig. 4 shows a block schematic diagram of an event extraction device according to an embodiment of the present application.

Fig. 5 shows a block schematic diagram of an electronic device provided in an embodiment of the present application.

An icon: 100-event extraction means; 101-obtaining a module; 102-event primary classification module; 103-event subject matching module; 104-event secondary classification module; 10-an electronic device; 11-a processor; 12-a memory; 13-bus.

Detailed Description

The technical solution in the embodiments of the present application will be described clearly and completely 0 with reference to the drawings in the embodiments of the present application.

The traditional pipeline type extraction is composed of a plurality of independent subtasks, which mainly comprise: the method comprises the steps of detecting trigger words, judging event types, extracting event elements and the like, wherein the core link is the extraction of the trigger words, and the extraction of the trigger words is complicated and difficult to exhaust. The joint extraction is to process the links of detecting trigger words, judging event types, extracting event elements and the like uniformly, and can supplement each link

The interaction between the individual texts. In contrast, however, the complexity of the joint extraction model is higher, and the actual effect is not necessarily better than that of the pipeline extraction.

In order to solve the problems, in the embodiment of the application, when extracting an event from a text to be processed, a text classification model is firstly utilized to perform event primary classification on the text to be processed, so as to obtain a prediction category label and text heat information of the text to be processed; then, according to the main body information of each event main body in the text to be processed, combining the text heat information and all events

Finding out a target event main body matched with the prediction category label from the main body; then, as the prediction category label is obtained by clustering at least one event type of 0, the secondary event classification is carried out on the text to be processed by utilizing the key feature word bank

Class, namely, restoring the target event type from the prediction class label to obtain an event label of the text to be processed; therefore, the event extraction can be realized under the condition that the trigger words do not need to be extracted, and the event extraction efficiency is improved. As described in detail below.

The text for extracting the event in the embodiment of the present application may be various texts such as news, blogs, treatises, electronic medical records, and the like, which is not limited in any way in the embodiment of the present application, and the following embodiment takes a news text as an example for description.

For convenience of understanding, before describing specific implementations of the embodiments of the present application, a training process of a text classification model and a process of establishing a keyword library are described.

First, a training process of the text classification model is introduced, which may include the following steps:

s1, obtaining a supervised corpus, wherein the supervised corpus comprises a plurality of training samples and an event type of each training sample.

In this embodiment, the training sample may be news text on the internet, including news headlines and news body. The event type may be manually labeled from the training sample, and may be a classification of news events involved in the training sample, such as high-head off-duty, and the like.

In practical applications, some news texts may be relatively long, and if the training samples directly adopt the news texts, the model training efficiency may be affected. Therefore, before model training, the news text can be abstracted through an automatic abstract model to control the overall text length, so that a training sample comprising news headlines and text abstracts is obtained.

Optionally, the automatic summarization model may be a model that can automatically generate a summary in the prior art, for example, seq2seq, etc., and this is not limited in this application.

In this embodiment, before labeling the event type for the training sample, an event system may be established according to business attributes, for example, related to business of an enterprise, and an event system may be established around the business of the enterprise, for example, related events of personnel, finance, administration, and the like, and then the training sample may be labeled according to the event system. It should be noted that the event system can be flexibly established by the user according to the actual service, and the embodiment of the present application does not limit this.

And S2, clustering all event types to obtain a plurality of category labels, wherein the category labels comprise at least one event type.

In practical applications, the event classification may be more detailed, for example, the event "high management leaving" may be influenced by different levels of high management, so that the event classification may be further subdivided into actual requirements of "leader person leaving", "core high management leaving", and "important high management leaving". However, for news related to the event types "top management away", "leader person away", "core top management away", and "important top management away", the general text descriptions are generally substantially consistent and the difficulty of distinguishing is high.

For the text classification scenes with low event type distinguishing degree, in order to improve the accuracy of text classification, the similar event types can be clustered into a class label, so that the problem of high distinguishing difficulty can be solved, and the efficiency of event extraction can be improved.

Therefore, after the supervised corpus is obtained, the event type of each training sample can be clustered to obtain a plurality of category labels, and then the plurality of category labels are utilized to classify texts.

Optionally, the process of clustering all event types in step S2 to obtain a plurality of category labels may include S21 to S24.

And S21, performing text steering on each training sample through the pre-trained word embedding model to obtain each word embedding information.

In this embodiment, the pre-trained Word embedding model may be a model capable of implementing a text steering amount in the prior art, for example, word2Vec, gloVe, and the like, which is not limited in this embodiment.

And S22, dividing the word embedding information with the same event type into a group, and taking the average value of all the word embedding information in the group as the feature vector of the event type to obtain the feature vector of each event type.

In this embodiment, each word embedding information is grouped according to event type, word embedding information having the same event type (e.g., core high management career) is divided into one group, and an average value of all word embedding information in the group is calculated as a feature vector of the event type, so that a feature vector of each event type can be obtained.

And S23, calculating the correlation between every two event types according to the feature vector of each event type.

In this embodiment, for any two event types, the cosine similarity between the feature vectors of the two event types is calculated as the correlation between the two event types, and so on, and the correlation between every two event types is calculated.

And S24, performing hierarchical clustering on all event types according to the correlation of every two event types to obtain a plurality of category labels.

In this embodiment, hierarchical clustering is performed according to the correlation of event types, each event type is classified into one class, the classes closest to each other are merged according to the correlation of event types calculated in S23 to obtain a new class, and the average value of the feature vectors of all event types in the new class is used as the feature vector of the new class, so that the feature vector of each new class can be obtained; and further combining the new classes according to the processes from S23 to S24, and repeating the steps until the combined result achieves the optimal classification effect to obtain a plurality of class labels.

For example, referring to fig. 1, assume that there are 8 event types: leader departure, core high-management departure, important high-management departure, general high-management violation, important high-management violation, violation of the hold, general hold, and stockholder commitment not to hold. Firstly, according to the process from S23 to S24, leading sleeve character departure, core high-management departure and important high-management departure are combined into a virtual node 1, general high-management violation and important high-management violation are combined into a virtual node 2, violation reduction and general reduction are combined into a virtual node 3, and stockholder commitment and non-reduction are independently in 1 category; then, further according to the processes of S23 to S24, the virtual node 1 and the virtual node 2 are combined into the category label 1, and the virtual node 3 and the shareholder commitment are combined into the category label 2 without loss of support.

It should be noted that the combined result achieves the optimal classification effect, the evaluation needs to be performed by comprehensively considering the training effect of the text classification model and the secondary event classification effect, and meanwhile, the evaluation needs to be determined by combining the experience of the user on the actual service, which is not limited in the embodiment of the present application.

And S3, training the text classification model by using the plurality of training samples and the plurality of class labels to obtain the trained text classification model.

In this embodiment, after the plurality of class labels are obtained in step S2, each training sample has one class label, and at this time, the text classification model may be trained by using the plurality of training samples and the class label of each training sample.

Referring to fig. 2, the text classification model may include a Bert model and a multi-label classifier, and the multi-label classifier corresponds to the plurality of class labels obtained in step S2, so that the process of training the text classification model by using the plurality of training samples and the plurality of class labels in step S3 to obtain the trained text classification model may include S31 to S34.

And S31, inputting a plurality of training samples and a plurality of class labels into the text classification model, and obtaining a sample embedding sequence of the training samples by using the Bert model, wherein the sample embedding sequence comprises word embedding for setting CLS symbols and word embedding of each word in the training samples.

In this embodiment, for the text classification task, the Bert model inserts a CLS symbol in front of the text, and uses an output vector corresponding to the symbol as a semantic representation of the whole text for text classification.

And then, performing text steering quantity on the training sample after the CLS symbol is inserted to obtain word embedding of the CLS symbol and word embedding of each word in the training sample.

And S32, learning semantic information of the sample embedded sequence based on the attention mechanism by using the Bert model, and obtaining an output vector of the CLS symbol.

In this embodiment, the Bert model is assembled from multiple layers of transformers, and the attention mechanism is the most critical part of the transformers, so the attention mechanism will be described with emphasis. The main functions of the attention mechanism are: let the model put "attention" on a part of the inputs, i.e.: the effect of different parts of the input on the output is distinguished. The method is combined into a text classification task, namely enhancing the semantic representation of the words, the context information of the words is helpful for enhancing the semantic representation of the words, meanwhile, the effects of different words in the context on enhancing the semantic representation are often different, and in order to distinctively utilize the context information to enhance the semantic representation of the target words, the attention mechanism can be used.

The attention mechanism mainly involves three concepts: query, key and Value are combined into the application scene of semantic representation of enhanced words, the target word and the word of the context thereof have respective original Value, the attribution mechanism takes the target word as Query and each word of the context thereof as Key, and takes the similarity between Query and each Key as weight, and the Value of each word of the context is merged into the original Value of the target word.

That is, the attention mechanism takes semantic vector representation of a target word and each word of context as input, first obtains Query vector representation of the target word, key vector representation of each word of context, and original Value representation of the target word and each word of context through linear transformation, then calculates similarity of the Query vector and each Key vector as weight, and weight-fuses the Value vector of the target word and the Value vector of each word of context as output, that is, enhanced semantic vector representation of the target word.

In this embodiment, after word embedding is completed, the Bert model learns semantic information of each position in a text by using an attention mechanism, and obtains an output vector of a CLS symbol, where the output vector of the CLS symbol is used to finally obtain a prediction class label of a training sample through a multi-label classifier.

And S33, classifying the output vector of the CLS symbol by using a multi-label classifier to obtain a prediction class label of the training sample.

In this embodiment, the multi-label classifier includes a plurality of class labels, which are obtained in step S2, for example, class label 1, class label 2, and the like. The output vector of the CLS symbol is classified by using a multi-label classifier, the probability value of each class label can be obtained, the class label with the probability value higher than a set threshold (for example, 0.5) is used as a prediction class label of a training sample, and the set threshold can be flexibly set by a user according to actual requirements.

And S34, training the text classification model based on the class label and the prediction class label of each training sample and a preset loss function to obtain the trained text classification model.

In this embodiment, the loss function is:

L _total (x ^k ,y ^k )＝[1+γ(1-F1 _body (x ^k ,u ^k ))]L _DB (x ^k ,y ^k )

wherein L is _total Represents a loss function, k represents the number of training samples, x represents a training sample,y represents the class label of the training sample (i.e., the class label obtained in step S2). And gamma represents the loss coefficient of the event subject, the gamma value is 0 in the initial training stage, and when the text classification accuracy is stable (for example, the F1 value of the latest 5 epoch text classifications is improved by no more than 0.01), the gamma value is gradually increased, the influence of the event subject accuracy on the loss is enhanced, and an optimal effect model is selected based on the final evaluation result.

F1 _body Representing the accuracy of the event subject, and when training a text classification model, calculating the accuracy of the event subject in order to increase the perception of the model to the event subject, specifically as follows:

wherein C represents the total number of class labels, and i represents the number of class labels of the multi-label classifier; TP, FP, and FN represent confusion matrix indexes of the classification result of the event subject in the ith class label of the kth training sample, specifically, four basic indexes of TP, FP, FN, and TN in the confusion matrix are as follows:

1. true values are Positive, the model considers the number of Positive (True Positive = TP);

2. true values are positive, the model considers the number of Negative (False Negative = FN);

3. the true value is negative, and the model considers the number of Positive (False Positive = FP);

4. the True value is Negative and the model considers the number of Negative (True Negative = TN).

It should be noted that the true value is the actual class label of the training sample, and the model is regarded as the predicted class label of the training sample output by the text classification model. A true value of positive means that the actual class label of the training sample is correct, i.e., a positive sample; the true value being negative means that the actual class label of the training sample is wrong, i.e., a negative sample. The model is considered to be positive, which means that the prediction class label of the training sample is correct, and the model is considered to be negative, which means that the prediction class label of the training sample is wrong.

In the calculation of the accuracy rate F1 of the event subject _body Then, the accuracy rate F1 of the event subject is used _body Based on this, the event subject accuracy factor, i.e., [1+ γ (1-F1) ] _body (x ^k ,y ^k ))]Added to the loss function of the model. The purpose of this is to: taking the accuracy of the subject of the event as part of the loss, the training process will reduce the overall loss and thus will increase the accuracy of the subject of the event.

L _DB And representing a classification loss function, wherein the classification loss function is a DB loss function (Distribution-balanced loss) in order to solve the problems of class imbalance and class co-occurrence of class labels. The unbalanced class means that, assuming that there are 10w training samples, there are 1w training samples of class label 1, and there are only 100 training samples of class label 2, and because the difference between the number of samples is too large, the training samples of class label 2 are easily ignored when training the model. The category co-occurrence means that when the model is trained, a sample of a certain category label is repeatedly trained by sample belts of other category labels, and the accuracy of model training is influenced.

The DB loss function mainly comprises two parts, wherein the first part is a weight rebalancing module after resampling, and the weight rebalancing module comprises the following specific steps:

wherein the content of the first and second substances,

representing the weight of the ith class label of the kth training sample after smoothing treatment, and being used for improving the class imbalance and class co-occurrence problems of the class labels; x represents the training sample, y represents the class label of the training sample (i.e., the class label obtained in step S2), and z represents the predicted class label of the training sample.

The second part is to improve the over-suppression of negative examples in the multi-label classification problem, which is specifically as follows:

where λ is a hyperparameter that affects the loss weight of the negative sample, v _i The weight bias for the ith class label.

After the two losses are combined, the DB loss function is obtained as follows:

next, a process of establishing a keyword library is described, which may include the following steps:

and S10, performing word segmentation on the supervised corpus and removing stop words to obtain a word segmentation result of each training sample.

In this embodiment, the supervised corpus is segmented, stop words are removed, and the filtered words are counted, so as to obtain the segmentation result of each training sample.

And S20, based on the word segmentation result of each training sample, eliminating the high-frequency public words of the training sample corresponding to each class label.

In the present embodiment, for each category label, high-frequency common words under the same category label, such as "leave job", "dictionary job", and the like, are removed.

And S30, screening out the special high-frequency words of the training sample corresponding to the event type aiming at each event type under any category label to obtain each key feature word of the event type.

In this embodiment, for each category label, the unique high-frequency words of each event type under the category label are screened out, for example, with reference to fig. 1, the category label 1 includes "leader person resignation", "core high management resignation", and "important high management resignation", and for the event type of "leader person resignation", the unique high-frequency words of the training sample corresponding to "leader person resignation", for example, president, CEO, president executive, highest executive, sponsor, employer, joint president, and the like, are screened out as the key feature words of the event type.

And S40, aiming at each key characteristic word of any event type under the category label, obtaining the weight of the key characteristic word according to the word frequency of the key characteristic word in the supervised corpus and the sample number of the training sample corresponding to the event type.

In this embodiment, after obtaining the key feature words of a certain event type in the manner of step S30, each key feature word is given a weight, and the calculation method is as follows:

wherein, w _i Weight, n, representing key feature word i _i Representing the ratio of the word frequency of the key characteristic word i in the supervised corpus to the sample number of the training samples corresponding to the event type in the supervised corpus, n _median Represents n _i The median of (2).

It should be noted that the above-mentioned process of calculating the weight of the key feature words is only an example, and in practice, the weight of each key word supports manual modification, and this is not limited in this embodiment of the present application.

And S50, obtaining the weight of each key feature word of each event type under each category label to obtain a key feature word library.

According to the process of the steps S30 to S40, for each category label, each key feature word and the weight thereof of each event type under the category label are obtained, and therefore the establishment of the key feature word library is completed.

The training process of the text classification model and the establishment process of the key feature word bank are introduced above, and on this basis, detailed description is given to specific implementation of the embodiment of the present application.

Referring to fig. 3, fig. 3 is a schematic flowchart illustrating an event extraction method according to an embodiment of the present application. The event extraction method is applied to the electronic equipment and can comprise the following steps:

s101, obtaining the text to be processed, each event main body in the text to be processed and main body information of the event main body.

In this embodiment, the text to be processed may be text that needs to be subjected to event extraction, for example, news text on the internet, including a title and a body, and the like. In practice, some news texts are relatively long, so before event extraction, the news texts can be abstracted through an automatic abstraction model to control the length of the whole text.

Alternatively, the process of obtaining the text to be processed and each event body in the text to be processed and the body information thereof in step S101 may include S1011 to S1013.

S1011, acquiring the original text.

And S1012, generating an abstract of the original text through the automatic abstract model to obtain the text to be processed.

And S1013, performing entity identification on the text to be processed through the entity identification model to obtain each event main body and main body information of the text to be processed.

In the present embodiment, the automatic summarization model may be a model that can automatically generate a summary in the prior art, for example, seq2 seq. The entity recognition model may be a model that can implement entity recognition in the prior art, such as BilSTM and the like.

In this embodiment, the event subject may be a name of a person, a name of an organization, a name of a place, and other entities identified by names, such as a company. The main body information may include position information, weight, and the like of the event main body in the text to be processed, which is not limited in any way by the embodiment of the present application.

S102, performing event primary classification on a text to be processed by using a pre-trained text classification model to obtain a prediction category label and text heat information of the text to be processed, wherein the prediction category label is obtained by clustering at least one event type.

In this embodiment, the process of performing the primary event classification on the text to be processed by using the pre-trained text classification model in step S102 may include S1021 to S1024.

And S1021, inputting the text to be processed into the text classification model, and obtaining an embedding sequence of the text to be processed by using the Bert model, wherein the embedding sequence comprises word embedding for setting CLS symbols and word embedding for each word in the text to be processed.

S1022, learning semantic information of the text to be processed based on an attention mechanism by using a Bert model, and obtaining an attention matrix corresponding to the text to be processed and an output vector of a CLS symbol; wherein, the moment matrix represents the similarity relation between the CLS symbol and each word in the text to be processed.

And S1023, classifying the output vector by using a multi-label classifier to obtain the probability value of each class label, and taking the class label with the probability value higher than a set threshold value as a prediction class label.

And S1024, performing linear transformation on the attention moment array by using the multi-label classifier to obtain text heat information, wherein the text heat information represents the relevance between a CLS symbol under the prediction class label and each word in the text to be processed.

In this embodiment, the Bert model uses an attention mechanism in the text classification inference process, and in this process, an attention matrix corresponding to the text to be processed can be extracted as follows:

the Attention matrix may reflect the similarity relationship between q and k of the corresponding position, i.e., the similarity relationship between the CLS symbol and each word in the text to be processed. The Attention matrix is process data of model reasoning, and the specific calculation is performed by a Query matrix

Key matrix->

And constant d _k Is calculated to get >>

Wherein l is the length of the input sequence and h is the dimension of the hidden layer.

The Attention matrix is performed through a multi-label classifierLinear transformation is adopted to obtain word heat under each category label

Wherein n is the number of category labels, as follows:

hot＝sigmoid(AttentionW)

wherein the W matrix is from a multi-label classifier,

in this embodiment, after the attention moment array is linearly transformed by using the multi-tag classifier to obtain the word heat under each class tag, hot information of the first CLS symbol is obtained, that is, matrix information hot [0,: i.e., a two-dimensional matrix of l × n is obtained from a hot three-dimensional matrix, where the first dimension is 0, and the correlation between the CLS symbol and each word in the text to be processed can be obtained.

Meanwhile, in the step S1023, the multi-tag classifier classifies the output vector of the CLS symbol to obtain a prediction category tag of the text to be processed, and the multi-tag classifier performs linear transformation on the Attention matrix to obtain word heat under each category tag, so that text heat information of the prediction category tag can be obtained from the hot three-dimensional matrix.

S103, obtaining a target event main body matched with the prediction type label according to the main body information and the text heat information of each event main body.

In this embodiment, a text classification model is used to perform event primary classification on a text to be processed, so as to obtain a prediction category tag and text heat information of the text to be processed, and then the prediction category tag is matched with an event main body, so as to obtain a target event main body matched with the prediction category tag.

Alternatively, the subject information may include location information, and the process of obtaining the target event subject matching the prediction category label according to the subject information and the text popularity information of each event subject in step S103 may include S1031 to S1033.

And S1031, according to the position information of each event main body, carrying out clause division on the text to be processed to obtain a text unit corresponding to each event main body.

In this embodiment, the text to be processed is divided into sentences according to the position information of the event body, and the text between the previous event body and the next event body is used as the text unit corresponding to the previous event body until the text unit corresponding to each event body is obtained.

S1032, calculating the text heat corresponding to each text unit according to the text heat information.

In this embodiment, after the text to be processed is divided into the text units corresponding to each event subject, the word popularity corresponding to each text unit is summed according to the text popularity information obtained in step S102, so as to obtain the text popularity corresponding to each text unit.

And S1033, taking the event main body corresponding to the text unit with the highest text popularity as the target event main body.

In this embodiment, after the text popularity corresponding to each text unit is calculated, the event body corresponding to the text unit with the highest text popularity is taken as the target event body. For example, assuming that the prediction category label is "financial loss or index variation", there are two event subjects "huanen international" and "yiwu rural commercial bank" in the text to be processed, and it is calculated that in the event of "financial loss or index variation", the text popularity of "huanen international" is higher than the text popularity of "yiwu rural commercial bank", then "huanen international" and "financial loss or index variation" are matched.

And S104, performing secondary event classification on the text to be processed by utilizing the pre-established key feature word library, and reducing a target event type from the prediction category label to obtain an event label of the text to be processed, wherein the event label comprises a target event main body and a target event type.

In this embodiment, because similar event types are clustered in the model training process, the prediction class label obtained by model prediction includes at least one event type, and therefore the prediction class label needs to be split and restored to obtain the target event type.

In this embodiment, the text classification model includes a plurality of category labels, each category label is obtained by clustering at least one event type, and the keyword library includes a keyword and a weight thereof corresponding to each event type under each category label, for example, in combination with fig. 1, the category label 1 is obtained by clustering three event types, namely, "leader person careers", "core high management careers", and "important high management careers", and the keyword library includes a keyword and a weight thereof corresponding to each of the "leader person careers", "core high management careers", and "important high management careers".

Therefore, the secondary event classification can be performed on the text to be processed in a key feature word scoring manner, and the target event type can be restored from the prediction category label, and the specific process can include S1041 to S1044.

S1041, performing word segmentation on the text to be processed to obtain a plurality of reference words.

S1042, aiming at each event type under the prediction category label, determining each target key feature word of the event type from a plurality of reference words based on the key feature word library.

And S1043, obtaining the weight of each target key feature word and summing the weights to obtain the weight of the event type.

And S1044, taking the event type with the highest weight as a target event type.

In this embodiment, when performing secondary event classification, the text to be processed is first segmented, and then, in combination with the key feature word library, it is determined whether each word is a key feature word of an event type under the prediction category label, so as to obtain a key feature word of each event type under the prediction category label. And then, summing the weights of the key characteristic words of each event type to serve as the weight of the event type, comparing the weights of the event types, and selecting the event type with the highest weight from the event types to serve as the final target event type.

Compared with the prior art, the embodiment of the application has the following beneficial effects:

firstly, the event extraction method provided by the embodiment of the application can obtain the event type with the accuracy rate of approximate text classification without extracting the trigger word.

Secondly, determining a hot spot area of the event information by adopting an Attention matrix based on an Attention mechanism, selecting an event main body close to the hot spot area as a final event main body, and completing the matching of the event type and the event main body.

Thirdly, aiming at a text classification scene with low event type discrimination, in order to improve the accuracy of text classification, firstly clustering and merging event types according to text word embedded information of labeled corpora, then carrying out primary event classification through a text classification model, and then carrying out secondary classification on similar events according to a key feature word grading mechanism on classification results, thereby better solving the problem of similar event classification, improving the classification error caused by insufficient perception of the model on local key information, and further improving the overall accuracy of event extraction.

In order to perform the corresponding steps in the above method embodiments and various possible embodiments, an implementation of the event extraction device is given below.

Referring to fig. 4, fig. 4 is a block diagram illustrating an event extraction apparatus 100 according to an embodiment of the present disclosure. The event extraction device 100 is applied to the electronic equipment 10 and comprises: the event classification method comprises an obtaining module 101, an event primary classification module 102, an event subject matching module 103 and an event secondary classification module 104.

An obtaining module 101, configured to obtain a to-be-processed text and each event subject in the to-be-processed text and subject information thereof.

The event primary classification module 102 is configured to perform event primary classification on a to-be-processed text by using a pre-trained text classification model to obtain a prediction category label and text heat information of the to-be-processed text, where the prediction category label is obtained by clustering at least one event type.

And the event main body matching module 103 is configured to obtain a target event main body matched with the prediction category tag according to the main body information and the text popularity information of each event main body.

And the event secondary classification module 104 is configured to perform event secondary classification on the text to be processed by using the pre-established key feature word bank, and restore the target event type from the prediction category label to obtain an event label of the text to be processed, where the event label includes a target event main body and a target event type.

Optionally, the obtaining module 101 is specifically configured to:

acquiring an original text;

generating an abstract of an original text through an automatic abstract model to obtain a text to be processed;

and carrying out entity recognition on the text to be processed through the entity recognition model to obtain each event main body and main body information of the text to be processed.

Optionally, the text classification model includes a Bert model and a multi-label classifier, the multi-label classifier including a plurality of category labels; the event primary classification module 102 is specifically configured to:

inputting a text to be processed into a text classification model, and obtaining an embedding sequence of the text to be processed by utilizing a Bert model, wherein the embedding sequence comprises word embedding of a set CLS symbol and word embedding of each word in the text to be processed;

learning semantic information of the text to be processed based on an attention mechanism by using a Bert model, and obtaining an attention matrix corresponding to the text to be processed and an output vector of a CLS symbol; wherein, the moment matrix represents the similarity relation between the CLS symbol and each word in the text to be processed;

classifying the output vector by using a multi-label classifier to obtain the probability value of each class label, and taking the class label with the probability value higher than a set threshold value as a prediction class label;

and performing linear transformation on the attention moment array by using a multi-label classifier to obtain text heat information, wherein the text heat information represents the relevance between the CLS symbol under the prediction class label and each word in the text to be processed.

Optionally, the subject information includes location information, and the event subject matching module 103 is specifically configured to:

according to the position information of each event main body, carrying out sentence separation on the text to be processed to obtain a text unit corresponding to each event main body;

and taking the event main body corresponding to the text unit with the highest text popularity as a target event main body.

Optionally, the text classification model includes a plurality of category labels, and each category label is obtained by clustering at least one event type; the key characteristic word bank comprises a plurality of key characteristic words corresponding to each event type and the weight of each key characteristic word; the event secondary classification module 104 is specifically configured to:

performing word segmentation on a text to be processed to obtain a plurality of reference words;

determining each target key feature word of the event type from a plurality of reference words based on a key feature word library aiming at each event type under the prediction category label;

and taking the event type with the highest weight as a target event type.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the event extraction apparatus 100 described above may refer to the corresponding process in the foregoing method embodiments, and is not described herein again.

Referring to fig. 5, fig. 5 is a block diagram illustrating an electronic device 10 according to an embodiment of the present disclosure. The electronic device 10 includes a processor 11, a memory 12, and a bus 13, and the processor 11 is connected to the memory 12 through the bus 13.

The memory 12 is used for storing a program, and the processor 11 executes the program after receiving the execution instruction to implement the event extraction method disclosed in the above embodiment.

The Memory 12 may include a Random Access Memory (RAM) and may also include a non-volatile Memory (NVM).

The processor 11 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by instructions in the form of hardware integrated logic circuits or software in the processor 11. The processor 11 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Micro Control Unit (MCU), a Complex Programmable Logic Device (CPLD), a Field Programmable Gate Array (FPGA), and an embedded ARM.

The embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by the processor 11, the event extraction method disclosed in the foregoing embodiment is implemented.

To sum up, according to the event extraction method, the event extraction device, the electronic device, and the storage medium provided in the embodiments of the present application, when extracting an event from a text to be processed, a text classification model is first used to perform event primary classification on the text to be processed, so as to obtain a prediction category label and text heat information of the text to be processed; then, according to the subject information of each event subject in the text to be processed, and in combination with text heat information, finding out a target event subject matched with the prediction type label from all event subjects; then, because the prediction category label is obtained by clustering at least one event type, event secondary classification is carried out on the text to be processed by utilizing the key feature word bank, and a target event type is restored from the prediction category label, so that an event label of the text to be processed can be obtained; therefore, under the condition that trigger words do not need to be extracted, event extraction can be achieved in a text classification mode, and the event extraction efficiency is improved.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An event extraction method, the method comprising:

acquiring a text to be processed, and each event main body and main body information thereof in the text to be processed;

obtaining a target event main body matched with the prediction type label according to the main body information of each event main body and the text heat information;

2. The method of claim 1, wherein the text classification model comprises a Bert model and a multi-label classifier, the multi-label classifier comprising a plurality of category labels;

3. The method of claim 1, wherein the subject information includes location information;

4. The method of claim 1, wherein the text classification model comprises a plurality of class labels, each of the class labels clustered for at least one event type; the key characteristic word library comprises a plurality of key characteristic words corresponding to each event type and the weight of each key characteristic word;

and taking the event type with the highest weight as the target event type.

5. The method of claim 1, wherein the step of obtaining the text to be processed and each event body and body information thereof in the text to be processed comprises:

acquiring an original text;

6. The method of claim 1, wherein the text classification model is trained by:

7. The method of claim 6, wherein the step of clustering all event types to obtain a plurality of category labels comprises:

performing text steering quantity on each training sample through a pre-trained word embedding model to obtain each word embedding information;

dividing the word embedding information with the same event type into a group, and taking the mean value of all the word embedding information in the group as the feature vector of the event type to obtain the feature vector of each event type;

8. The method of claim 6, wherein the text classification model comprises a Bert model and a multi-label classifier, the multi-label classifier comprising the plurality of category labels;

the step of training the text classification model by using the training samples and the category labels to obtain a trained text classification model comprises:

9. The method of claim 8, wherein the loss function is:

L _total (x ^k ,y ^k )＝[1+γ(1-F1 _body (x ^k ,y ^k ))]L _DB (x ^k ,y ^k )

wherein L is _total Representing the loss function, k represents the number of training samples, x represents the training samples, y represents class labels of the training samples, γ represents the coefficient of loss of the event subject, F1 _body Indicating event subject accuracy, L _DB Representing a classification loss function;

the event subject accuracy is:

wherein C represents the total number of class labels, i represents the number of class labels of the multi-label classifier; TP, FP and FN represent confusion matrix indexes of event main body classification results in the ith class label of the kth training sample;

the classification loss function is:

wherein the content of the first and second substances,

represents the weight of the ith class label of the k training sample after smoothing, z represents the predicted class label of the training sample, lambda is a hyperparameter influencing the loss weight of the negative sample, v _i The weight bias for the ith class label.

10. The method of claim 6, wherein the keyword library is created by:

11. An event extraction device, the device comprising:

the event primary classification module is used for performing event primary classification on the text to be processed by utilizing a pre-trained text classification model to obtain a prediction category label and text heat information of the text to be processed, wherein the prediction category label is obtained by clustering at least one event type;

12. An electronic device, comprising a processor and a memory, the memory being configured to store a program, the processor being configured to implement the event extraction method of any one of claims 1-10 when executing the program.

13. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out an event extraction method according to any one of claims 1 to 10.