WO2023035330A1

WO2023035330A1 - Long text event extraction method and apparatus, and computer device and storage medium

Info

Publication number: WO2023035330A1
Application number: PCT/CN2021/120030
Authority: WO
Inventors: 谢翀; 罗伟杰; 陈永红; 黄开梅
Original assignee: 深圳前海环融联易信息科技服务有限公司
Priority date: 2021-09-13
Filing date: 2021-09-24
Publication date: 2023-03-16
Also published as: CN113535963A; CN113535963B

Abstract

Disclosed in the present application are a long text event extraction method and apparatus, and a computer device and a storage medium. The method comprises: acquiring a trigger word in long text of an event to be extracted, and performing text truncation on the long text according to the trigger word, so as to obtain truncated text; using a deep learning model to classify and predict a plurality of event types corresponding to the truncated text; in combination with machine reading comprehension technology and a pointer network model, extracting corresponding event role information for each event type; and on the basis of a sequence generation algorithm, combining all event role information into a target event, and outputting the target event as an event extraction result. In the present application, by means of performing event classification, event role extraction and event combination on long text, the event extraction efficiency and extraction accuracy of the long text are improved.

Description

A long text event extraction method, device, computer equipment and storage medium

This application is based on a Chinese patent application with application number 202111065602.1 and a filing date of September 13, 2021, and claims its priority. The entire content of this application is hereby incorporated into this application as a whole.

technical field

The present application relates to the field of computer technology, in particular to a long text event extraction method, device, computer equipment and storage medium.

Background technique

At present, major news media, public accounts, tweet bloggers, etc. generate a large amount of information every day, including but not limited to news reports, comment predictions, analysis and interpretation, etc. These texts are often very long, complex in content, and have different opinions, and service companies often need to monitor these text information to obtain industry dynamics and event information in a timely manner. The traditional event extraction method mainly requires specification formulation by domain experts and a large number of manual screening and verification. This method has a large workload and low efficiency and accuracy. Therefore, this application is based on deep learning technology and can realize fully automated event extraction. Extraction greatly improves efficiency and exceeds manual verification in accuracy.

The existing event extraction methods for long text generally have a relatively simple definition of events. For example, some financial public opinion analysis platforms mainly extract the main event roles of financial texts, display them through keywords and other forms, and at the same time evaluate the emotional tendency of the entire text. This type of platform mainly uses simple event classification and NER (Named Entity Recognition, named entity recognition technology) extracts events from long text. The event classification technology is to classify the original text with classification labels, and the same text may have multiple labels; the named entity recognition technology is to identify and extract some keyword information that may exist in the original text, such as company, time, etc.

A second, more similar approach is relation extraction for shorter texts. It is mainly aimed at the title, summary, summary, etc. of the article, and at the same time pays more attention to the subject, object and the relationship between them in the text. This type of method mainly applies the technology of relation extraction, and there are two implementation methods in the general direction. The first one uses named entity technology to identify the subjects in the text, and then combines the objects and their relationships through other models. Extraction; the second method uses named entity technology to extract the subject and object in the text at the same time. If there are multiple subjects or objects, it is necessary to pair and group different subjects and objects through a binary classification model.

For the first existing method mentioned above, first of all, the information extracted by the existing existing method is less. For example, in the long text of the "company listing" type, the existing method mainly focuses on the specific listed company and time That is, other important information such as "financing scale", "listed market value", and "financing rounds" have not been extracted or displayed. Secondly, the existing methods only give users reminders at the level of emotion classification, and there are no relevant reminders in terms of importance, timeliness, authority, etc.

For the second relationship extraction method mentioned above, it is relatively simple to extract only the subject, object and association relationship. Secondly, the application of the method is relatively narrow. Due to the limitation of simple information extraction, this method is generally only used for information extraction of short texts, which greatly affects the scope of application. At the same time, the relationship extraction method requires that the subject and object must exist at the same time. In reality, the text often lacks the subject or object. For example, "A company goes public", there is only the subject "A company", and there is no corresponding object, so this method cannot be applied. The second relation extraction method has significant limitations.

application content

Embodiments of the present application provide a long text event extraction method, device, computer equipment, and storage medium, aiming at improving the efficiency and accuracy of event extraction for long text.

In the first aspect, the embodiment of the present application provides a long text event extraction method, including:

Acquiring the trigger word in the long text of the event to be extracted, and performing text truncation on the long text according to the trigger word to obtain the truncated text;

Using a deep learning model to classify and predict multiple event types corresponding to the truncated text;

Combining machine reading comprehension technology and pointer network model to extract corresponding event role information for each event type;

Based on a sequence generation algorithm, all the event role information is combined into a target event, and the target event is output as an event extraction result.

In the second aspect, the embodiment of the present application provides a long text event extraction device, including:

The first truncation unit is configured to obtain trigger words in the long text of the event to be extracted, and perform text truncation on the long text according to the trigger words to obtain the truncated text;

A first classification prediction unit, configured to use a deep learning model to classify and predict multiple event types corresponding to the truncated text;

The first extraction unit is used to combine machine reading comprehension technology and pointer network model to extract corresponding event role information for each event type;

The result output unit is configured to combine all the event role information into a target event based on a sequence generation algorithm, and output the target event as an event extraction result.

In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and operable on the processor. When the processor executes the computer program, Realize the method for extracting long text events as described in the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the long text as described in the first aspect is implemented. Event extraction method.

An embodiment of the present application provides a long text event extraction method, device, computer equipment, and storage medium, the method including: acquiring trigger words in the long text of the event to be extracted, and performing text truncation on the long text according to the trigger words , to obtain the truncated text; use the deep learning model to classify and predict multiple event types corresponding to the truncated text; combine machine reading comprehension technology and pointer network model to extract corresponding event role information for each event type; based on the sequence generation algorithm , combining all the event role information into a target event, and outputting the target event as an event extraction result. The embodiments of the present application improve the event extraction efficiency and extraction accuracy for long texts by performing event classification, event role extraction, and event combination on long texts.

Description of drawings

In order to illustrate the technical solutions of the embodiments of the present application more clearly, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can also obtain other drawings based on these drawings on the premise of not paying creative work.

Fig. 1 is a schematic flow chart of a long text event extraction method provided by the embodiment of the present application;

Fig. 2 is a schematic subflow diagram of a long text event extraction method provided in the embodiment of the present application;

FIG. 3 is a schematic subflow diagram of a long text event extraction method provided in an embodiment of the present application;

FIG. 4 is a schematic block diagram of a long text event extraction device provided in an embodiment of the present application;

FIG. 5 is a sub-schematic block diagram of a long text event extraction device provided in an embodiment of the present application;

FIG. 6 is a sub-schematic block diagram of an apparatus for extracting long text events provided by an embodiment of the present application.

Detailed ways

The following will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

It should be understood that when used in this specification and the appended claims, the terms "comprising" and "comprises" indicate the presence of described features, integers, steps, operations, elements and/or components, but do not exclude one or Presence or addition of multiple other features, integers, steps, operations, elements, components and/or collections thereof.

It should also be understood that the terminology used in the specification of this application is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this specification and the appended claims, the singular forms "a", "an" and "the" are intended to include plural referents unless the context clearly dictates otherwise.

It should be further understood that the term "and/or" used in the description of the present application and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .

Please refer to FIG. 1 below. FIG. 1 is a schematic flowchart of a method for extracting long text events provided by an embodiment of the present application, which specifically includes steps S101 to S104.

S101. Obtain a trigger word in the long text of the event to be extracted, and perform text truncation on the long text according to the trigger word to obtain the truncated text;

S102. Using a deep learning model to classify and predict multiple event types corresponding to the truncated text;

S103. Combine machine reading comprehension technology and pointer network model to extract corresponding event role information for each event type;

S104. Based on the sequence generation algorithm, combine all the event role information into a target event, and output the target event as an event extraction result.

In this embodiment, the event extraction process is specifically divided into three stages: event classification, event role extraction, and event combination. Among them, in the event classification stage, the trigger word is used to truncate the long text, and then the deep learning model is used to classify and predict the truncated text. In the event role extraction stage, since the truncated text and event classification information of all truncated texts are obtained in the event classification stage, it is necessary to extract the event role information for each event type in the event role extraction stage, that is, MRC ( Machine Reading Comprehension (machine reading comprehension technology) + pointer network strategy to extract event role information. In the event combination stage, through the model extraction of the first two stages, all event roles of each truncated text belonging to a certain event type are obtained, so this stage combines all event role information into a complete The event (that is, the target event) is output externally.

This embodiment improves the event extraction efficiency and extraction accuracy for long texts by performing event classification, event role extraction, and event combination on long texts. The long text described in this embodiment may be papers, news reports, magazines and so on. For example, the event extraction for news reports is more detailed, which can support more fine-grained queries and reduce the time for users to read the original text. In addition, the order of importance of event roles is provided, so that users can selectively focus on some key points. At the same time, this embodiment adopts related technologies of deep learning, which greatly saves the workload of later operations and audits.

It should be noted that, in the stage of event classification, although there are existing technologies for text truncation, such as random truncation, head-to-tail truncation, etc., both of them will have different degrees of information loss. Although multiple binary classification schemes can be used in multi-label classification, this scheme may have the problem of sample imbalance, and the prediction effect is poor for texts with fewer actual events.

In the stage of event role extraction, the effectiveness of existing technologies in the case of large data volume and complex and changeable types of verification is still unknown. However, in this embodiment, the F1 of the whole process has reached 0.7+. The current evaluation index is set to the whole process F1, which means that from the very beginning of text input, n events are output, and each event outputs m event roles. The calculation formula of F1 is 2 * (p * r) / (p + r) , where p is the accuracy rate, which represents the correct proportion of m * n event roles; r is the recall rate, which represents the correct number of m * n event roles, relative to the proportion of the total number of labels.

In the stage of event combination, the existing solutions can only be paired by business personnel constantly updating the rule engine, which is inefficient, inaccurate and costly, but this embodiment can solve the above defects.

In an embodiment, as shown in FIG. 2 , the step S101 includes: steps S201-S204.

S201. Select a trigger word in the long text through the trigger word dictionary, and use the trigger word to pre-truncate the long text;

S202. Based on the pre-truncated long text, count the number of sentences and the total number of words between different trigger words;

S203. Construct a discrete interval according to the total number of words between different trigger words, and select the interval with the largest number of words distributed based on the discrete interval;

S204. Select the mode number in the word count interval as the word count threshold, and use the word count threshold to truncate the long text.

In this embodiment, in the stage of event classification, there are two major pain points in news reports, such as the length of the text is too long, and the types of events contained are various. For pain point 1 (that is, the length of the text is too long), there will first be a dictionary of trigger words sorted out by domain experts. The trigger word means that if there is a corresponding keyword in the text, there is a certain probability that there is a corresponding type of event. In this stage, text truncation is mainly combined with event trigger words. The specific method is: first find out all the trigger words in the text, and truncate sentences with a certain word count threshold in the trigger word context. The word count threshold is mainly determined through statistics. Since the Chinese pre-training model generally limits the maximum input text length in order to ensure the effect, the original text needs to be truncated. The specific process is:

Separate statistics on long texts according to different event dimensions. First, truncate long texts according to periods, question marks, exclamation points, etc.

Count the number of sentences and the total number of words between different trigger words. For example, there is a trigger word "listing" in the "company listing" event, and there is a trigger word "delisting" in the following "company delisting" event, then at this stage, the number of words between "listing" and "delisting" will be counted as The number of words below and the number of words above the trigger word of "listing" are processed in the same way.

After the statistics are completed, the specific number of words will be discretized into specific intervals, such as (below 50 words), (50-100 words), etc., and the distribution of each major interval will be counted, and finally the word number interval with the largest distribution proportion will be selected as the trigger word Before and after the word count threshold for text segmentation.

In an embodiment, as shown in FIG. 3 , the step S102 includes: steps S301 to S304.

S201. Obtain a training set including truncated training text and event types, and stitch the truncated training text in the training set according to event labels;

S202. Perform convolution processing on the spliced truncated training text by adding a deep learning model of a convolution kernel;

S203. Optimizing and updating the improved deep learning model by using a focal-loss loss function;

S204. Use the updated deep learning model to perform event classification prediction on the truncated text.

In this embodiment, aiming at the pain point 2 of the above-mentioned event classification stage (that is, the types of events contained are diverse), the training and prediction structure of the deep learning model is modified, and a multi-label classification technology is applied to ensure that each truncated The text of can be predicted as multiple event types, the specific process is:

In the training phase, this embodiment performs text splicing on the truncated text and each event type, and separates them with special characters. For example, if there are 10 event types, the original single training text will become 10 training texts. At this time, the corresponding training label becomes a binary classification label, that is, the training goal of the model is optimized to determine whether the text belongs to one of the event labels. , which can well solve the problem of small sample size. At the model level, some changes have been made to adapt to the changes in the process. The model no longer performs convolution on the original text, but performs convolution on the original text after splicing event labels. At this time, the semantics of the text may be quite different. In order to deal with this problem, this embodiment adds a small number of convolution kernels with a step size of 2 while retaining the original convolution kernel with a step size of 1, so as to improve the information of texts that are far away. extraction capacity.

In addition, this embodiment has also carried out some modifications in the final loss calculation. Since the original model handles multi-label text, the original loss calculation is no longer suitable for the existing binary classification model, and at the same time avoids a large number of negative samples generated after binary classification , this embodiment adopts the focal-loss loss function, so as to effectively avoid the binary classification loss function that the model tends to fit negative samples caused by too many negative samples.

In the prediction stage, all event types are also spliced after the original text. For example, the same predicted text will be expanded to 10 predicted texts. After the same reasoning, the model can obtain the 2 classification results of whether it belongs to the event type. After post-processing and summarizing all the event types predicted to be 1, all the texts can be obtained. event type. In the prediction stage, the transformation at the model level, the feedforward calculation is consistent with the training stage, and there are also a small number of convolution kernels with a step size of 2, mainly to ensure that the parameters in the training stage can be completely reproduced in the prediction. In addition, the output of the prediction result does not need to go through the focal-loss loss calculation, and the activation function result of the previous layer can be directly output.

In one embodiment, the step S103 includes:

Using a question-and-answer structure to splice question sentences after each event type of the truncated text;

Constructing a label list according to the concatenated questions through the pointer network model, and using the label list to predict the starting position probability value and the ending position probability value of the question sentence in the truncated text;

Select the start position and the end position with the highest probability value, and use the text content between the start position and the end position as the event role information of the corresponding event type.

In this embodiment, there are many pain points in the extraction of event roles, such as various role tags, overlapping, splitting, etc., some roles cannot be identified under event constraints, etc. These pain points cannot be solved by traditional NER technology. In order to solve these pain points, this embodiment adopts the strategy of MRC (Machine Reading Comprehension, that is, machine reading comprehension technology) + pointer network. Among them, MRC technology (Machine Reading Comprehension, that is, machine reading comprehension technology) mainly adopts a question-and-answer overall structure, that is, splicing questions after the input truncated text, which can greatly enrich the truncated text, and after adding questions It can focus more on the extraction of role information in this event. For example, after the truncated text of "Company A went public in October this year." Add the question "In the event company's listing, what is the listed company?" to form a new truncated text "A company went public in October this year. In the event company In listing, what is a listed company?", in the input truncated text, it is very important for the learning of the model to learn that there is a co-occurrence relationship between "listed company" and A text.

In addition, it is also necessary to predict the start and end positions of the answers to the concatenated questions in the truncated text. In addition, separate questions are set for each event role under each event type, that is, if there are 10 event roles under one event type, the original text will be spliced into 10 questions to form 10 training samples for training.

The most important training goal of event role recognition (that is, event role information acquisition) is to obtain the starting position and ending position of the character in the truncated text, but if the starting position and ending position between the starting position and the ending position are also There are other event roles. For example, "Shenzhen" in "Shenzhen Huawei Technology Company" is both the company name and the region where it is located. The traditional event role recognition technology cannot solve this problem well. The pointer network mainly uses two sets of label values to fit the start position and end position respectively. At the same time, there are two independent sets of label lists for each event role to isolate. The model needs to predict two sets of predictions for each event role separately. value, calculate the loss with two sets of label lists respectively, and finally ensure that the optimal solution can be obtained under each event role. The input of the pointer network is still the truncated text of the concatenated questions under the MRC structure.

For example, if the length of the truncated text of the spliced question is 100, the pointer network will construct two label lists of length 100. The first label list is mainly responsible for predicting the starting position of the event role. Each position will output the probability value of whether it is the starting position, and find the position with the highest probability value as the starting position of the event role. The specific process can have a variety of basic networks. In this embodiment, the encoder of Transformer can be used for processing. Transformer is widely used in the field of NLP. It has powerful feature changes and processing capabilities, and can well extract the surface syntactic structure of the input text. information and deep semantic information. The overall process is similar to the pointer moving back and forth on the text with a length of 100 until the start bit position is found. The second label list works in the same way as the first label list, except it transforms the fit target (i.e., the start position) to the end position of the event actor.

For the problem that the same entity has multiple event role labels, and the first half and the second half of the same entity belong to different types of labels, this embodiment uses a pointer network to convert the multi-label recognition problem into a large number of single-label binary classification problems , to avoid information confusion. Aiming at the problem that some characters cannot be recognized before event constraints, this embodiment adopts MRC technology, which mainly converts the original text, and sends the original text splicing question text together into the pre-trained language model. The model needs to predict the location of the answer to the question text. The question text is strongly related to the event type, so it can realize the strong constraint of the event type on the event role, and ensure that the event role information under each event conforms to the rules formulated by domain experts.

In one embodiment, the sequence generation algorithm is DOC2EDAG algorithm.

In this embodiment, the full name of EDAG is Entity-based Directed Acyclic Graph, which means an entity-based directed acyclic graph, constructs a series of event roles extracted from long texts into a directed acyclic graph, that is, generates a sequence composed of event roles as a single event.

In one embodiment, the step S104 includes:

sorting all event roles under each event type based on the event role information;

Update the state of the event role under each event type through a state variable;

According to the sorting result and the state update result, construct a directed acyclic graph for all event roles through the DOC2EDAG algorithm, obtain a sequence of all event role information combinations, and output the sequence as the target event.

In this embodiment, the pain point in the event combination stage is that any event role of any event may be one entity, multiple entities, or even no entity, so it will face extremely complex logic processing in pairing and combination. At present, this pain point is mainly dealt with by rules in the industry, and there are certain models in the academic world. However, this embodiment is based on the DOC2EDAG algorithm, which converts event combinations into sequence generation tasks. Specifically, for each event type, define a sequence for all event roles of subordinates, and update each event role step by step. The standard for defining the order can be determined by domain knowledge experts, and the standard is the role importance ordering under a single event dimension. For example, the importance of roles in the "company listing" event is: listed company, listing link, listing stock exchange, listing time and so on.

At the same time, through the state variable m, record the state of the entire event when each event type is updated to a certain event role. When expanding the next event role node, it will be based on the current state variable m and the newly added event role node. Feature e is used for comprehensive judgment.

Then, according to the sorting result and the state update result, a sequence combination is generated for the event role information, which is output as the event extraction result.

In an embodiment, updating the state of the event role subordinate to each event type through a state variable includes:

Obtaining at least one newly added event role node, and performing feature transformation on each of the event role nodes by using a fully connected layer;

Splicing the feature transformation result with the state variable, and inputting the splicing result into a fully connected layer and an activation function in turn to obtain a matching probability value between each event role node and the corresponding event role;

Select the event role node with the largest matching probability value as the prediction result of the corresponding event role, and update the corresponding event type.

In this embodiment, the comprehensive judgment is mainly determined by the fully connected layer of the neural network. The main process is that the node feature e of the newly added event role node is transformed through the fully connected layer, and then spliced with the state variable at this time, and then passed A fully connected layer and an activation function to obtain the probability value that the event role node matches the event role. The event role node with the highest matching probability value is selected as the prediction result of the event role.

Each event role node may be a real entity, or it may be a null value, and finally the common prefixes are merged to form each individual event.

It should also be noted that since the overall process of event extraction is too long, it is necessary to divide and conquer by combining different models after dismantling the process. There are different pain points at different stages, and this embodiment can perfectly solve the existing pain points. The connection between the various stages is mainly achieved through series connection. Taking the input of a long text as an example, the first stage (that is, the event classification stage) mainly outputs the event types (multi-classification) of all truncated texts of the long text; the second stage (that is, the event role Extraction stage) input these truncated texts, and mainly output all event roles identified under each event type of each truncated text; the third stage (ie, event combination stage) input all event roles, and obtained a batch of event roles through the sequence generation model All events, and finally realize the requirements of event extraction.

FIG. 4 is a schematic block diagram of a long text event extraction device 400 provided in an embodiment of the present application. The device 400 includes:

The first truncation unit 401 is configured to obtain trigger words in the long text of the event to be extracted, and perform text truncation on the long text according to the trigger words to obtain the truncated text;

The first classification prediction unit 402 is configured to use a deep learning model to classify and predict multiple event types corresponding to the truncated text;

The first extraction unit 403 is used to combine machine reading comprehension technology and pointer network model to extract corresponding event role information for each event type;

The result output unit 404 is configured to combine all the event role information into a target event based on a sequence generation algorithm, and output the target event as an event extraction result.

In an embodiment, as shown in FIG. 5, the first truncation unit 401 includes:

The trigger word selection unit 501 is used to select the trigger word in the long text through the trigger word dictionary, and utilize the trigger word to pre-truncate the long text;

A statistical unit 502, configured to count the number of sentences and the total number of words between different trigger words based on the pre-truncated long text;

The interval selection unit 503 is used to construct a discrete interval according to the total number of words between different trigger words, and select the interval with the largest number of words distributed based on the discrete interval;

The word count threshold setting unit 504 is configured to select the mode number in the word count interval as the word count threshold, and use the word count threshold to perform text truncation on long texts.

In one embodiment, as shown in FIG. 6, the first category prediction unit 402 includes:

A label splicing unit 601, configured to obtain a training set comprising truncated training text and event types, and stitch the truncated training text in the training set according to the event label;

The convolution processing unit 602 is used to perform convolution processing on the spliced truncated training text by increasing the deep learning model of the convolution kernel;

An optimization update unit 603, configured to optimize and update the improved deep learning model using a focal-loss loss function;

The second category prediction unit 604 is configured to use the updated deep learning model to perform event category prediction on the truncated text.

In an embodiment, the first extraction unit 403 includes:

A question splicing unit for splicing questions after each event type of the truncated text using a question-and-answer structure;

The probability prediction unit is used to construct a label list according to the concatenated questions through the pointer network model, and use the label list to predict the starting position probability value and the ending position probability value of the question sentence in the truncated text;

The position selection unit is configured to select the start position and the end position with the highest probability value, and use the text content between the start position and the end position as the event role information under the corresponding event type.

In one embodiment, the sequence generation algorithm is DOC2EDAG algorithm.

In one embodiment, the result output unit 404 includes:

a role sorting unit, configured to sort all event roles under each event type based on the event role information;

A state update unit, configured to update the state of the event roles under each event type through a state variable;

The sequence output unit is used to construct a directed acyclic graph for all event roles through the DOC2EDAG algorithm according to the sorting result and the state update result, to obtain a sequence of all the event role information combinations, and use the sequence as the target event output.

In one embodiment, the status update unit includes:

A feature transformation unit, configured to obtain at least one newly added event role node, and perform feature transformation on each of the event role nodes by using a fully connected layer;

The feature splicing unit is used to splice the feature transformation result and the state variable, and input the splicing result into the fully connected layer and the activation function in turn to obtain the matching probability value between each of the event role nodes and the corresponding event role;

The node selection unit is configured to select the event role node with the highest matching probability value as the prediction result of the corresponding event role, and update the corresponding event type.

Since the embodiment of the device part corresponds to the embodiment of the method part, please refer to the description of the embodiment of the method part for the embodiment of the device part, and details will not be repeated here.

The embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored. When the computer program is executed, the steps provided in the above-mentioned embodiments can be realized. The storage medium may include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk, and other media capable of storing program codes.

The embodiment of the present application also provides a computer device, which may include a memory and a processor. A computer program is stored in the memory. When the processor invokes the computer program in the memory, the steps provided in the above embodiments can be implemented. Of course, the computer equipment may also include components such as various network interfaces and power supplies.

Each embodiment in the description is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same and similar parts of each embodiment can be referred to each other. As for the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and for the related information, please refer to the description of the method part. It should be pointed out that those skilled in the art can make some improvements and modifications to the application without departing from the principles of the application, and these improvements and modifications also fall within the protection scope of the claims of the application.

It should also be noted that in this specification, relative terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or operations There is no such actual relationship or order between the operations. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

Claims

A method for extracting long text events, comprising:

Acquiring the trigger word in the long text of the event to be extracted, and performing text truncation on the long text according to the trigger word to obtain the truncated text;

Using a deep learning model to classify and predict multiple event types corresponding to the truncated text;

Combining machine reading comprehension technology and pointer network model to extract corresponding event role information for each event type;

Based on a sequence generation algorithm, all the event role information is combined into a target event, and the target event is output as an event extraction result.
The long text event extraction method according to claim 1, wherein said acquiring the trigger word in the long text of the event to be extracted, and performing text truncation on the long text according to the trigger word, to obtain the truncated text includes:

Select trigger words in the long text through the trigger word dictionary, and use the trigger words to pre-truncate the long text;

Based on the pre-truncated long text, count the number of sentences and the total number of words between different trigger words;

Construct discrete intervals according to the total word count between different trigger words, and select the word count interval with the largest distribution ratio based on the discrete intervals;

Select the mode number in the word count interval as the word count threshold, and use the word count threshold to truncate the long text.
The long text event extraction method according to claim 1, wherein said utilizing a deep learning model to classify and predict multiple event types corresponding to said truncated text comprises:

Obtain a training set containing truncated training texts and event types, and stitch the truncated training texts in the training set according to event labels;

Convolute the spliced truncated training text by adding a deep learning model of the convolution kernel;

Optimize and update the improved deep learning model by using the focal-loss loss function;

Event classification prediction on truncated text with an updated deep learning model.
The long text event extraction method according to claim 1, characterized in that, the combination of machine reading comprehension technology and pointer network model extracts corresponding event role information for each of the event types, including:

Using a question-and-answer structure to splice question sentences after each event type of the truncated text;

Constructing a label list according to the concatenated questions through the pointer network model, and using the label list to predict the starting position probability value and the ending position probability value of the question sentence in the truncated text;

Select the start position and the end position with the highest probability value, and use the text content between the start position and the end position as the event role information of the corresponding event type.
The long text event extraction method according to claim 1, wherein the sequence generation algorithm is a DOC2EDAG algorithm.
The long text event extraction method according to claim 5, wherein the sequence-based generation algorithm combines all the event role information into a target event, and outputs the target event as an event extraction result, include:

sorting all event roles under each event type based on the event role information;

Update the state of the event role under each event type through a state variable;

According to the sorting result and the state update result, construct a directed acyclic graph for all event roles through the DOC2EDAG algorithm, obtain a sequence of all event role information combinations, and output the sequence as the target event.
The method for extracting long text events according to claim 6, wherein said updating the state of the event roles subordinate to each event type through a state variable includes:

Obtaining at least one newly added event role node, and performing feature transformation on each of the event role nodes by using a fully connected layer;

Splicing the feature transformation result with the state variable, and inputting the splicing result into a fully connected layer and an activation function in turn to obtain a matching probability value between each event role node and the corresponding event role;

Select the event role node with the largest matching probability value as the prediction result of the corresponding event role, and update the corresponding event type.
A long text event extraction device is characterized in that it comprises:

The first truncation unit is configured to obtain trigger words in the long text of the event to be extracted, and perform text truncation on the long text according to the trigger words to obtain the truncated text;

A first classification prediction unit, configured to use a deep learning model to classify and predict multiple event types corresponding to the truncated text;

The first extraction unit is used to combine machine reading comprehension technology and pointer network model to extract corresponding event role information for each event type;

The result output unit is configured to combine all the event role information into a target event based on a sequence generation algorithm, and output the target event as an event extraction result.
A computer device, characterized in that it includes a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor executes the computer program, it realizes the requirements of claims 1 to 1. 7. The long text event extraction method described in any one.
A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the long text event according to any one of claims 1 to 7 is realized extraction method.