CN116991983A

CN116991983A - Event extraction method and system for company information text

Info

Publication number: CN116991983A
Application number: CN202311259460.1A
Authority: CN
Inventors: 李栓; 王笑; 朱健平; 那崇宁
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2023-11-03
Anticipated expiration: 2043-09-27
Also published as: CN116991983B

Abstract

The application discloses an event extraction method and system for company information texts, wherein in an event extraction task for company information texts, a new labeling rule is provided for solving the problem of interference of noise company names to model performances, the noise company names are incorporated into a labeling system for entity identification, and event categories corresponding to the noise company names are set; the noise problem that the company name field needs to be extracted and the event type corresponding to the company name needs to be judged is converted into a simple classification problem, so that the pressure of a model is greatly relieved, and the difficulty of a task is reduced; and a two-stage extraction model of company names and event types is constructed, so that the accuracy of extracting company name fields and judging event types corresponding to the company names by the model is improved.

Description

Event extraction method and system for company information text

Technical Field

The application relates to two crossing fields of natural language processing and finance, in particular to an event extraction method and an event extraction system for company information texts.

Background

The task mode of event extraction for company information text is: extracting what happens to a company (event body) from a given information text (event type); however, a noise company name is often present in a given text of the task, i.e. the company name field is only mentioned or appears in the text in the given text, nothing happens, in a labeling system commonly used for the task, the part of the company name is not labeled, and the model structure facing the task is also often affected by the part of the noise company name.

At present, the model structure of the task is mainly divided into two types: 1. two-section extraction, namely extracting a company name field in a text, and judging what happens in the text in the company name; when the company name field in the task mode is extracted, the company name field in the text is accurately extracted, and whether the company name field has event types set in a labeling system or not in the context is judged, so that the accuracy of model identification and extraction is lower, and particularly satisfactory application performance cannot be achieved under the condition of few samples; 2. the method comprises the steps of jointly extracting, extracting company name fields in a text and judging the event types of the company names in a given text, wherein to a certain extent, the judgment of the event types by a model gives certain information to a company name extraction task, so that the model is helpful to judge whether the set event types occur in the given text in the company name fields to be extracted in the text, however, the problem of company name noise in the text is not solved by the model structure at the source, and a large amount of interference exists on the model in part of noise.

Therefore, there is a need to address the technical challenges of how to optimize and mitigate interference with model performance from company name field noise of a set event type that does not occur in a given text.

Disclosure of Invention

Aiming at the defects of the prior art, the application aims to provide an event extraction method and an event extraction system for company information texts.

The technical scheme adopted for solving the technical problems is as follows:

a method for extracting event oriented to company information text comprises the following steps:

(1) Acquiring information texts facing companies, and constructing a corpus of the information texts; cleaning and preprocessing information text in a corpus;

(2) Labeling the cleaned information text according to a preset rule; performing text vectorization and label digitization on the marked information text;

(3) Constructing a two-stage event extraction model of company names and event types, training, and extracting the company names and the corresponding event types by using the trained model;

(4) Finally screening and outputting the extracted company name and the corresponding event type;

specifically, the step (1) of cleaning and preprocessing the information text in the corpus specifically includes: the operations of unifying English letters and cases, unifying Chinese and English punctuation marks, converting traditional Chinese into simplified Chinese, deleting messy codes and failing to print characters are sequentially carried out.

Further, the step (2) of labeling the cleaned information text according to a preset rule includes the following sub-steps:

(2.1) field for labeling all company names and their abbreviations in the text of the information [com ₁ ,com ₂ ,com ₃ , …]；

(2.2) according to the preset event type [EventType ₁ ,EventType ₂ ,EventType ₃ , …,EventType _n ,None,Out]Marking all event types that occur in a given information text for a company name field [EventType ₁ ,EventType ₂ ,EventType ₃ , …,EventType _n ]Representing to be drawnThe event type is taken, n represents a total of n event types,Noneindicating that the company name field does not happen anything in the given information text,Outindicating that the company name field has occurred in a given information text with event types other than the event type to be extracted.

Further, the specific fields marked with all company names and short names in the information text in the step (2.1) are as follows:

(2.1.1) acquiring an open source data set of company name strong labels, naming an entity identification data set with CLUENER fine granularity, and individually screening samples containing the company name labels in the data set; the strong annotation means that the accuracy of the annotation on the sample is more than 98%;

(2.1.2) constructing and training a BERT+Softmax company name entity extraction model, and automatically marking information texts by using the trained company name entity extraction model;

(2.1.3) obtaining an open source company noun table, and continuing to label the company name on the information text by using a forward matching algorithm and the open source company noun table;

and (2.1.4) finally, manually verifying, checking and correcting the wrongly marked company name field, and carrying out supplementary marking on the company name field which is not marked.

Further, in the step (2), text vectorization and tag digitization are performed on the labeled information text, which specifically includes: vectorizing the input information text to obtain; the company name is coded in the position of the descriptive text by using BIO coding rules, the position of each company name in the label is masked in the information text by using a number 1 to generate a masking vector of each company name relative to the information text, an event category label corresponding to each event main body is generated,kindicating co-existence in consultation textkAnd each company name field has a corresponding mask vector and event category label.

Further, in the step (3), a two-stage event extraction model of company name and event type is constructed and trained, specifically: inputting a vectorized representation of information text into a pre-training model BERT ₁ Obtaining information textSemantic representation of the book; sequentially inputting a semantic representation of a text into a Linear function Linear and a normalized exponential function Softmax of a layer to obtain a predicted probability value of whether characters in the information text are company name fields, and calculating a loss value in the process of fitting the company name fields by using a cross entropy function, namely, obtaining a trained company name prediction model after back propagation and parameter optimization; inputting a vectorized representation of information text into a pre-training model BERT ₂ Obtaining a semantic representation of the information text; traversing the mask vector for each information text, screening the text using the mask vectoriName of middle companyjThe corresponding characterization vector is sequentially input into a pooling function Avgpool, a single-layer Linear function Linear and a logistic regression function Sigmoid to obtain the company namejIn textiAnd (3) calculating the Loss value in the process of predicting the event type by using a two-class cross entropy Loss function Loss, namely, carrying out back propagation and model parameter optimization to obtain a trained event type prediction model.

Further, in the step (3), the trained model is used to extract the company name and the corresponding event type, specifically: according to the two-stage event extraction model for constructing company name and event type and training to obtain the probability value of whether the character in the input information text is a company name field, and extracting the company name field in the input information text according to the probability value; masking the position of each company name in the information text by using the number 1 to generate a masking vector of each company name relative to the information text; according to a two-stage event extraction model for constructing company names and event types, obtaining probability distribution of different events occurring in information texts of each company name field in training, extracting event types occurring in each company name field in the input information texts according to the probability distribution, and if the event type occurring in a certain company name field is null, namely the probability of each category in the probability distribution is smaller than 0.5, selecting the event type with the largest probability value as the predicted event type.

Further, the screening and outputting the company name and the event type extracted by the model in the step (4) specifically includes: judging whether the event type corresponding to the company name contains Out and None, if so, deleting the event type, if not, outputting the company name and the event type corresponding to the company name, and if so, deleting the company name and the event type corresponding to the company name.

Another aspect of the application: an event extraction system for company information text, comprising: the system comprises a text database module, a text preprocessing module, a text labeling module, a text modeling module and an output module;

a text database module: acquiring and storing information texts facing to companies; the text preprocessing module is used for cleaning and preprocessing information texts in the corpus;

the text labeling module: labeling the cleaned information text according to a preset rule;

text modeling module: the method is used for text vectorization and label digitization, and builds a joint extraction model and training of company names and event types;

and an output module: the system is used for outputting the company name and event type extracted by the model;

a terminal device comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the company information text oriented event extraction method when the computer program is executed.

The beneficial effects of the application are as follows:

1. in the event extraction method facing company information text, a new labeling rule is provided for the interference problem of noise company names on models, the noise company names are brought into the labeling rule of entity identification, corresponding labels of the noise company names are attached, and company name extraction noise which needs to be judged simultaneously for company name types and company name boundaries is converted into simple classification problems, so that model ground pressure is greatly relieved, task ground difficulty is reduced, and identification and extraction ground precision is improved;

2. in the event extraction method for the company information text, a three-section labeling method is provided, automatic labeling of a deep learning model is sequentially carried out, automatic labeling of an external word list is carried out, manual labeling and error correction flow are carried out, the machine learning method is fully utilized in labeling tasks, the workload and pressure of labeling personnel are relieved, and the labeling accuracy is improved.

3. In the event extraction method for the company information text, provided by the application, the accuracy of event extraction in the company information text is improved by adopting a two-stage event extraction model of company names and event types in the face of the proposed labeling rules.

Drawings

FIG. 1 is a method for extracting events oriented to company information text;

FIG. 2 is a flow chart of labeling information text in an event extraction method for company information text;

FIG. 3 is a diagram showing a model structure and training flow chart in an event extraction method for company information text;

FIG. 4 is a flow chart of an event extraction system for company information text;

fig. 5 is a schematic diagram of an electronic device according to the present application.

Detailed Description

The application is further described below with reference to examples. The following examples are presented only to aid in the understanding of the application. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the application can be made without departing from the principles of the application and these modifications and adaptations are intended to be within the scope of the application as defined in the following claims.

The application is further illustrated with reference to the following drawings:

the first aspect of the application:

referring to fig. 1, a method for extracting event oriented to company information text includes the following steps:

step S1: acquiring information texts facing companies, and constructing a corpus of the information texts;

step S2: cleaning and preprocessing information text in a corpus;

step S3: labeling the cleaned information text according to a preset rule;

step S4: performing text vectorization and label digitization on the marked information text;

step S5: constructing a two-stage event extraction model of company names and event types, training, and extracting the company names and the corresponding event types by using the trained model;

step S6: screening and outputting the name and event type of the company extracted by the model

Further, the step S2 mainly includes: the operations of unifying English letters and cases, unifying Chinese and English punctuation marks, converting traditional Chinese into simplified Chinese, deleting messy codes and being incapable of printing characters are sequentially carried out;

further, the step S3 includes the steps of:

step S31, marking all company names and abbreviated fields in the information textcom ₁ ,com ₂ ,com ₃ , …]The method comprises the steps of carrying out a first treatment on the surface of the According to the predetermined event type [EventType ₁ ,EventType ₂ ,EventType ₃ , …,EventType _n ,None,Out]Marking all event types that occur in a given information text for a company name field [EventType ₁ ,EventType ₂ ,EventType ₃ , …,EventType _n ]Representing the event type to be extracted, n representing a total of n event types,Noneindicating that the company name field does not happen anything in the given information text,Outindicating that the company name field has event types except the event type to be extracted in the given information text;

step S32, using the example text "A company today' S fast news: a person leaves a certain department president and leaves from the company B; some stakeholders want to hold no more than 6% of the shares. For example, the noted company name fields are "company A", "department", "company B", "a plurality of" and "a rich stock", and the event types are None, high-level change, out, stockholder hold-down and Out respectively;

further, referring to FIG. 2, the step S31 marks all the fields of the company names and the abbreviations thereof in the information text[com ₁ ,com ₂ ,com ₃ , …]The method specifically comprises the following steps:

step S311, acquiring an open source data set of company name strong labeling, constructing a BERT+Softmax company name entity extraction model, training, and automatically labeling information text by using the constructed company name entity extraction model;

step S312, obtaining an open source company noun table, and continuing to label company names on the information text by using a forward matching algorithm and the open source company noun table;

step S313, finally, performing manual verification and correcting the company name field of the error label;

further, the step S4 mainly includes:

s41: vectorizing the input information text to obtain; the position of the company name in the descriptive text is encoded by using BIO encoding rules to obtain the company name; masking the location of each company name within the tag in the information text using the number 1 to generate a mask vector for each company name relative to the information text; an event category label corresponding to each event body is generated,kindicating co-existence in consultation textkEach company name field is provided with a corresponding mask vector and event category label;

s42: the text "A company today's newsletter" is used as an example: a person leaves a certain department president and leaves from the company B; some stakeholders want to hold no more than 6% of the shares. "for example, text vectorization results in a one-dimensional vector of length 46 [101, 4567, …,102 ]]Company namecom ₁ Mask vector corresponding to = "company a" ism ₁ =[0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]The corresponding event type islab ₁ =[1,0,0…,0]，lab ₁ Is the number of preset event types.

Further, referring to fig. 3, the step S5 of constructing a two-stage event extraction model of company name and event type and training includes the following steps:

step S51: input pre-training of vectorized representations of information textModel BERT ₁ Obtaining a semantic representation of the information text;

step S52: sequentially inputting the semantic representation of the text into a Linear function Linear and a normalized exponential function Softmax of a layer to obtain a predicted probability value of whether characters in the information text are company name fields,

step S53: calculating a loss value in the process of fitting the company name field by using a cross entropy function, namely, obtaining a trained company name prediction model after back propagation and parameter optimization;

step S54: inputting a vectorized representation of information text into a pre-training model BERT ₂ Obtaining a semantic representation of the information text;

step S55: traversing the mask vector for each information text, screening the text using the mask vectoriName of middle companyjThe corresponding characterization vector is sequentially input into a pooling function Avgpool, a single-layer Linear function Linear and a logistic regression function Sigmoid to obtain the company namejIn textiProbability distribution of occurrence of different events;

step S56: calculating a Loss value in the process of predicting the event type by using a two-class cross entropy Loss function Loss, namely, carrying out back propagation and model parameter optimization to obtain a trained event type prediction model;

further, referring to fig. 3, the step S5 of extracting the company name and the corresponding event type using the trained model includes the following steps:

step S57: according to step S51, S52 obtains the probability value of whether the character in the input information text is the company name field, and extracts the company name field in the input information text according to the probability value; masking the position of each company name in the information text by using the number 1 according to the step S41 to generate a masking vector of each company name relative to the information text; according to step S54, S55 obtains probability distribution of different events occurring in the information text in each company name field, and extracts event types occurring in each company name field in the input information text according to the probability distribution, if the event type occurring in a certain company name field is null, i.e. the probability of each category in the probability distribution is less than 0.5, then selecting the event type with the largest probability value as the predicted event type;

further, the step S6 of screening and outputting the company name and event type extracted by the model includes: judging whether the event type corresponding to the company name contains Out and None, if so, deleting the event type, if not, outputting the company name and the event type corresponding to the company name, and if so, deleting the company name and the event type corresponding to the company name.

The second aspect of the application:

referring to fig. 4, an event extraction system for company information text includes: the system comprises a text database module, a text preprocessing module, a text labeling module and a text modeling module;

the text database module is used for acquiring and storing information texts facing to companies;

the text preprocessing module is used for cleaning and preprocessing information texts in the corpus;

the text labeling module is used for labeling the cleaned information text according to a preset rule;

the text modeling module is used for text vectorization and label digitization, and constructing a joint extraction model and training of company names and event types;

the output module is used for outputting the company name and event type extracted by the model.

The specific manner in which the various modules perform the operations in relation to the systems of the above embodiments have been described in detail in relation to the embodiments of the method and will not be described in detail herein.

For system embodiments, reference is made to the description of method embodiments for the relevant points, since they essentially correspond to the method embodiments. The system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present application without undue burden.

Correspondingly, the application also provides electronic equipment, which comprises: one or more processors; a memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement a company information text oriented event extraction method as described above. As shown in fig. 5, a hardware structure diagram of any device with data processing capability in which the system is located in the embodiment of the present application is shown in fig. 5, and besides the processor, the memory and the network interface shown in fig. 5, any device with data processing capability in the embodiment of the present application may further include other hardware according to the actual function of the any device with data processing capability, which is not described herein.

Accordingly, the present application also provides a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement a company information text oriented event extraction method as described above.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments.

The computer readable storage medium may also be an external storage device, such as a plug-in hard disk, a Smart Media Card (Smart Media Card, SMC),

SD Card, flash Card (Flash Card), etc.

Further, the computer readable storage medium may include both internal storage units and external storage devices of any device having data processing capabilities.

The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.

It will be understood that the above is only one embodiment of the present application, and that the present application is not limited to the structure that has been described above and shown in the drawings, but that several modifications and adaptations can be made without departing from the principle of the present application. The scope of the application is limited only by the appended claims.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof.

Claims

1. An event extraction method for company information text is characterized by comprising the following steps:

(4) And finally screening and outputting the extracted company name and the corresponding event type.

2. The event extraction method for company information text according to claim 1, wherein the step (1) of cleaning and preprocessing the information text in the corpus is specifically: the operations of unifying English letters and cases, unifying Chinese and English punctuation marks, converting traditional Chinese into simplified Chinese, deleting messy codes and failing to print characters are sequentially carried out.

3. The event extraction method for company information texts according to claim 1, wherein the step (2) of labeling the cleaned information texts according to a preset rule comprises the following sub-steps:

(2.1) field for labeling all company names and their abbreviations in the text of the information [com ₁ , com ₂ , com ₃ , …]；

(2.2) according to the preset event type [EventType ₁ ,EventType ₂ ,EventType ₃ ,…, EventType _n , None,Out]Marking all event types that occur in a given information text for a company name field [EventType ₁ ,EventType ₂ ,EventType ₃ , …, EventType _n ]Representing the event type to be extracted, n representing a total of n event types,Noneindicating that the company name field does not happen anything in the given information text,Outindicating that the company name field has occurred in a given information text with event types other than the event type to be extracted.

4. A method for extracting event oriented to company information text according to claim 3, wherein the step (2.1) is characterized in that the specific fields for marking all company names and short names in the information text are:

(2.1.1) acquiring an open source data set of company name strong labels, naming an entity identification data set with CLUENER fine granularity, and individually screening samples containing the company name labels in the data set;

5. The event extraction method for company information text according to claim 1, wherein the operations of text vectorization and tag digitizing of the marked information text in the step (2) are specifically as follows: vectorizing the input information text to obtain; the company name is coded in the position of the descriptive text by using BIO coding rules, the position of each company name in the label is masked in the information text by using a number 1 to generate a masking vector of each company name relative to the information text, an event category label corresponding to each event main body is generated,kindicating co-existence in consultation textkAnd each company name field has a corresponding mask vector and event category label.

6. The method for extracting event oriented to company information text according to claim 1, wherein the step (3) of constructing and training a two-stage event extraction model of company name and event type comprises the following sub-steps:

(6.1) inputting the vectorized representation of the information text into a pre-training model BERT ₁ Obtaining a semantic representation of the information text;

(6.2) sequentially inputting a Linear function Linear and a normalized exponential function Softmax of a layer of semantic representation of the text to obtain a predicted probability value of whether characters in the information text are company name fields, and calculating a loss value in the process of fitting the company name fields by using a cross entropy function, namely, obtaining a trained company name prediction model after back propagation and parameter optimization;

(6.3) inputting the vectorized representation of the information text into the pre-training model BERT ₂ Obtaining a semantic representation of the information text;

(6.4) traversing the mask vector for each information text, screening the text using the mask vectoriName of middle companyjThe corresponding characterization vector is sequentially input into a pooling function Avgpool, a single-layer Linear function Linear and a logistic regression function Sigmoid to obtain the company namejIn textiAnd (3) calculating the Loss value in the process of predicting the event type by using a two-class cross entropy Loss function Loss, namely, carrying out back propagation and model parameter optimization to obtain a trained event type prediction model.

7. The method for extracting event oriented to company information text according to claim 1, wherein the step (3) uses a trained model to extract company names and corresponding event types, specifically: according to the two-stage event extraction model for constructing company name and event type and training to obtain the probability value of whether the character in the input information text is a company name field, and extracting the company name field in the input information text according to the probability value; masking the position of each company name in the information text by using the number 1 to generate a masking vector of each company name relative to the information text; according to a two-stage event extraction model for constructing company names and event types, obtaining probability distribution of different events occurring in information texts of each company name field in training, extracting event types occurring in each company name field in the input information texts according to the probability distribution, and if the event type occurring in a certain company name field is null, namely the probability of each category in the probability distribution is smaller than 0.5, selecting the event type with the largest probability value as the predicted event type.

8. The method for extracting event oriented to company information text according to claim 1, wherein the filtering and outputting the company name and the event type extracted by the model in the step (4) specifically comprises: judging whether the event type corresponding to the company name contains Out and None, if so, deleting the event type, if not, outputting the company name and the event type corresponding to the company name, and if so, deleting the company name and the event type corresponding to the company name.

9. A corporate information text oriented event extraction system comprising: the system comprises a text database module, a text preprocessing module, a text labeling module, a text modeling module and an output module;

and an output module: and the system is used for outputting the company name and event type extracted by the model.

10. A terminal device comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the company information text oriented event extraction method according to any one of claims 1 to 8 when the computer program is executed.