CN117436521A

CN117436521A - Method for automatically constructing large-scale event relation corpus in biomedical field and corpus

Info

Publication number: CN117436521A
Application number: CN202311563992.4A
Authority: CN
Inventors: 李丽双; 费禹潇; 张贝贝; 冯大鹏
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2023-11-22
Filing date: 2023-11-22
Publication date: 2024-01-23

Abstract

A method for automatically constructing large-scale event relation corpus in biomedical field and corpus belong to natural language processing field, in order to solve the problem of automatically labeling large-scale event relation corpus in biomedical field, the key point is S10. Obtain biological entity and event relation from biomedical text; s20, defining element paths and constructing trigger word semantic matching templates; s30, calculating a key element path set according to the key element path proportion; s40, selecting the semantic type with the highest key element path coverage proportion from semantic types in the trigger word semantic matching template according to the trigger word matching rate, and using the semantic type as the event relation matched by the current trigger word pair for automatically labeling large-scale event relation corpus in the biomedical field.

Description

Method for automatically constructing large-scale event relation corpus in biomedical field and corpus

Technical field:

the invention belongs to the field of natural language processing, and particularly relates to a method for constructing a large-scale event relation corpus in the biomedical field and a corpus obtained by applying the method.

The background technology is as follows:

in the past decade, the research of event relation extraction is mainly based on a series of manually-marked corpora, but is limited by cost and field characteristics, the existing corpus is small in scale and limited in field, and the effect and application range of model training are limited. Therefore, automatically building a large-scale, professional-area event relationship corpus is a hotspot for continued attention of researchers. Although the current research has achieved a certain result in the general field, in the biomedical field, the corpus construction of event relation extraction is still in a primary stage, so how to further develop the research of the event relation corpus construction in the biomedical field on the basis of the general field is an important problem faced by current researchers.

At present, a method for automatically constructing a large-scale event relation corpus is mainly concentrated in the general field and is generally divided into an unsupervised method and a weak supervision method.

The unsupervised method generally marks event relationships directly in the original data text through a pre-designed event relationship template. For example, radinsky et al (Learning causality for news events prediction. Proceedings of the 21st international conference on World Wide Web,2012) automatically annotate causal event pairs directly from large-scale original news headlines by designing causal relationship templates between events. The method has simple steps and strong operability, only needs to match the fixed template, is effective in labeling the explicit event relationship, and has an unsatisfactory labeling effect on the implicit relationship. To avoid the above problems, researchers have proposed weak supervision event relationship labels.

The weak supervision event relation annotation mainly comprises two methods, namely a Bootstrapping-based method and a remote supervision-based method. The Bootstrapping method is a common method for constructing event or event relation corpus at present, for example, williams et al (Extracting and modeling durations for habits and events from Twitter. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics,2012) trains a Bootstrapping classifier based on a decision tree, can automatically estimate the duration of events and habits described in the push, and marks an oversized corpus containing 1,400 thousands of push. Yao et al (A weakly supervised approach to train temporal relation classifiers and acquire regular event pairs simultaneousness of the International Conference Recent Advances in Natural Language Processing, 2017) first designed a timing template to identify pairs of seed events, then iteratively populated these seeds into a large unlabeled corpus, training a timing event relationship classifier using hundreds of thousands of sentences containing pairs of events as examples, and continuously expanding the scale of the corpus. In the initial iteration stage, the training model of the artificial annotation corpus is only used in a small amount, and in the iteration process, as the proportion of the artificial annotation corpus of the training set is reduced, the supervision effect of an expert is gradually weakened, and semantic drift is easy to occur. For this purpose, many scholars propose to automatically construct a large-scale event relationship corpus by using a knowledge base-based remote supervision method. For example, hashimoto et al (Generating event causality hypotheses through semantic references. Proceedings of the Twitness-Ninth AAAI Conference on Artificial Intelligence, 2015) introduce a semantic knowledge base, create an extended causal hypothesis candidate set by replacing noun elements in causal event pairs, and then use an event sample training classifier in the causal hypothesis candidate set to perform credibility ranking on test results, thereby realizing event causal relationship labeling. Hassazadeh (Building a Knowledge Graph of Events and Consequences Using Wikipedia and Wikidata. In Proceedings of the Wiki Workshop at The Web Conference, 2022) obtains a variety of event types and event descriptions from Wikipedia, and constructs a series of prompt engineering templates for event relationship questions and answers. The template can induce the pre-training language model to generate corresponding cause event description on the unlabeled data, so that the corresponding effect event is obtained by linking to wikipedia, and finally a complete cause event pair is formed and used for constructing an event cause and effect knowledge graph. Although the conventional remote supervision method based on the knowledge base is effective in the general field mainly comprising simple events, in the biomedical field, complex events are numerous, and matching modes of complex event relations are more difficult to acquire, so that the labeling precision of the event relations in the biological field cannot be guaranteed by the general field method.

In summary, the disclosed large-scale corpus is not found in the field of biological event relation extraction at present, and due to the complexity of biological event relations, the corpus construction task of the biological event relation cannot be well dealt with by the corpus construction method in the general field, so that the progress of related researches is limited. Therefore, how to design a high-precision and high-efficiency automatic labeling method for the complexity of the biological event relationship to construct a large-scale biological event relationship corpus is a problem to be solved in the deep research of the biological event relationship extraction.

The invention comprises the following steps:

in order to solve the problem of automatically labeling large-scale event relation corpora in the biomedical field, the method for automatically constructing the large-scale event relation corpora in the biomedical field according to some embodiments of the present application comprises

S10, obtaining biological entities and event relations from biomedical texts;

s20, defining element paths and constructing trigger word semantic matching templates;

s30, calculating a key element path set according to the key element path proportion;

s40, selecting the semantic type with the highest key element path coverage proportion from the semantic types in the trigger word semantic matching templates according to the trigger word matching rate, and taking the semantic type as the event relation matched by the current trigger word pair.

According to some embodiments of the present application, the method for automatically constructing a large-scale event relation corpus in biomedical field calculates a key element path set according to the key element path proportion in step S30, and specifically includes

Calculating element path importance APS _i,j Represented by the following formula:

calculating event relationship correlation ERR _i Represented by the following formula:

calculating key element path proportion KARP _i,j Represented by the following formula:

KRARP _i,j ＝APS _i,j *ERR _i

calculating the key element path proportion KARP of each event semantic relation type _i,j Sorting top-K element paths to be used as a current key element path set;

wherein: count (PA) _i ,ETP _j ) Representing a jth semantic relationship type ETP in a knowledge base _j The ith element path PA is included _i Is the number of samples of (a); count (ETP) _j ) Representing a jth semantic relationship type ETP in a knowledge base _j The total number of all samples; sum (ETP) represents all semantic relationship type numbers in the semantic relationship type set ETP in the knowledge base; count (ETPC) _i ) Representing that the knowledge base contains the ith element path PA _i Epsilon represents a constant that prevents the denominator from being 0.

According to some embodiments of the present application, the method for automatically constructing a large-scale event relation corpus in the biomedical field specifically includes

Calculating trigger word pair candidate frequencies TPCF _i,j Represented by the following formula:

calculating trigger word semantic matching frequency TPMF _i Represented by the following formula:

calculating a trigger word matching rate, which is expressed by the following formula:

TMR _i,j ＝TPCF _i,j *TPMF _i

for a trigger word pair, TMR is selected _i,j The largest event semantic relation type is the event relation matched with the current trigger word pair;

wherein: count(EP _i ,TPS _j ) Representing the ith trigger word pair EP in text _i At the jth event semantic type pair TPS _j The number of samples below; count (TPS) _j ) Representing a jth semantic type pair TPS in a document _j The number of pairs containing all trigger words; sum (ETP) represents the total number of semantic type pairs in the set of semantic type pairs ETP; count (ETPS) _i ) Representing pairs of containing trigger words EP _i Semantic type pairs of (c).

According to some embodiments of the present application, a method for automatically constructing a large-scale event relational corpus in a biomedical field, the formalized format of element paths is as follows:

Wherein the trigger word T ₁ And element 1 belongs to event E ₁ Trigger word T ₂ And element 2 belongs to event E ₂ The character type edge refers to the relationship between the internal elements of the biological event and the trigger word unit, and the complex event is marked by using a plurality of element paths.

According to the method for automatically constructing the large-scale event relation corpus in the biomedical field, S50, event relation expansion and noise filtering are carried out according to the query relation of trigger words to the FrameNet semantic units mapped with the knowledge base.

Methods for automatically constructing large-scale event relationship corpora in biomedical fields according to some embodiments of the present application, wherein the event relationship expansion and noise filtering include

Trigger word pairs in texts can be inquired in the FrameNet semantic units mapped by the knowledge base, words and phrases in the semantic units mapped by the knowledge base and corresponding to the frames are used for expanding the scale of the trigger word pairs, so that the labeling scale of event relations is expanded, and a large-scale automatically labeled biological event relation corpus is obtained;

and if trigger word pairs in the text are not queried in the FrameNet semantic units mapped by the knowledge base, the trigger word pairs are used as noise filtering.

The method for automatically constructing the large-scale event relation corpus in the biomedical field according to some embodiments of the application further comprises S60, expert checking and extracting a model back mark by using the two-stage training event relation.

According to the method for automatically constructing the large-scale event relation corpus in the biomedical field, in the step S60, an expert checks part of the large-scale automatically marked biological event relation corpus to verify whether an automatically marked result is accurate or not, and the expert verifies accurate data to form a high-quality data set;

the large-scale automatically marked biological event relation corpus which is not checked by an expert is a pre-training corpus of an event relation extraction model, and the event relation extraction model is trained and back-marked by adopting two stages.

According to some embodiments of the application, the method for automatically constructing the large-scale event relation corpus in the biomedical field adopts two stages to train and return the event relation extraction model, and comprises the following steps of

The first stage, pre-training an event relation extraction model by using a pre-training corpus, wherein the pre-training enables the event relation extraction model to learn the characteristics of complex biological events and adapt to biomedical event relation extraction tasks;

the second stage, dividing the high-quality data set into training, verifying and testing sets, and performing fine adjustment on the pre-trained event relation extraction model through the high-quality data set to reduce deviation between prediction and a real result;

and (3) performing back labeling on the large-scale automatically marked biological event relation corpus which is not checked by the expert by using an event relation extraction model after two-stage training, wherein the back labeling data and the data of the high-quality data set are large-scale high-quality biological medical event relation corpus.

The invention also relates to a large-scale event relation corpus in the biomedical field, which is constructed by any one of the methods.

The invention has the beneficial effects that:

the method provided by the invention can automatically label the large-scale event relation corpus in the biomedical field, and further obtains the high-quality labeling corpus through expert correction, two-stage training and back labeling, and finally obtains the large-scale high-quality event relation corpus in the biomedical field, thereby being beneficial to promoting the event relation extraction research in the biomedical field.

The key element path proportion is utilized, the complex event relationship is marked through a plurality of element paths, the potential relationship of the elements in different complex events can be found under the condition of finer granularity, and the marking quality is improved.

Wherein, because the number of biological event elements is smaller, the matched key element paths are smaller, which may lead to that the content ratio of a plurality of key element paths in the element path set is highest at the same time, which can introduce ambiguity and uncertainty for automatic annotation of the following event relationship. In order to overcome the problems, the invention utilizes the trigger word types associated with elements to construct a trigger word matching template, calculates the trigger word matching rate, and selects the most consistent trigger word meaning type combination to determine the optimal event relationship type.

Description of the drawings:

FIG. 1 is a framework diagram of automated annotation of large-scale biological event relational corpora based on a knowledge base.

The specific embodiment is as follows:

the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. The invention provides a method for automatically constructing a large-scale event relation corpus in the biomedical field, which mainly comprises the following four parts:

component element path and obtaining key element path

(II) trigger word sense matching

(III) event relationship expansion and noise filtering

(IV) expert review and extraction of model back-labeling using two-stage training event relationships

The specific flow is as follows:

firstly, a biomedical text is required to be acquired and preprocessed, biological entity identification and biological event extraction are carried out, and a basic biomedical unit is acquired, wherein the biomedical unit comprises entity, event and element relation between elements and trigger words in the event.

(2) Component element path: formalize element paths as<Trigger word T ₁ The character type edge of element 1, element relation, element 2, character type edge of element 2, trigger word T ₂ >In which the trigger word T ₁ And element 1 belongs to event E ₁ Trigger word T ₂ And element 2 belongs to event E ₂ The character type edge refers to the relationship between the internal elements of the biological event and the trigger word unit, and can be obtained in the event extraction process. Meanwhile, element relationships in element path edges can be obtained through complex semantic networks in the knowledge base. In order to effectively utilize the structured information of more complex fine-grained elements such as entities, events and the like and improve the labeling quality, the invention uses a plurality of element paths to label complex events. In addition, a form is constructed based on a given element path and the event type corresponding to the trigger word<Event type 1 → semantic type → event type 2>The trigger word semantic matching template of (c) is used as a subsequent annotation.

(3) Key element path acquisition: in order to more effectively express the possible event relationships between two event types, the present invention defines the K most important element paths as key element paths.

(4) The invention provides a new measurement index, namely key element path proportion (KARP), which can be obtained by calculating key element path importance (APS) and event relation correlation (ERR), and calculating each event relation type KARP _i,j And sorting the top-K element paths to be used as a current key element path set.

And (II) the trigger word semantic matching (1) converts the semantic type in the template into an event semantic relation type matched with the trigger word through the acquired trigger word semantic matching template. Tool withIn the body, from semantic types contained in the key element path set, selecting the semantic type with the highest key element path coverage ratio as the semantic type which is matched with the current two trigger words (trigger word T ₁ Trigger word T ₂ ) The event semantic relationship type that matches.

(2) The invention provides a new trigger word semantic matching index, namely trigger word matching rate (TMR). In order to select the most accordant trigger word meaning type combination, the trigger word matching rate (TMR) is used as a trigger word meaning matching index. And defining the trigger word pair set generated by matching as EP and the corresponding semantic type pair set as TPS. Similar to element path ratio calculation, trigger word matching rate (TMR) is calculated by trigger word pair candidate frequency (TPCF) and trigger word semantic matching frequency (TPMF), respectively, by selecting TMR _i,j The largest event semantic relation type is the event relation matched with the current trigger word pair, and the event relation is marked for the trigger word pair through the knowledge base semantic network retrieval.

(III) event relationship expansion and noise filtering

In the labeling flow, the trigger words are expanded and filtered by using a universal semantic knowledge resource frame Net framework because the conditions that part of trigger words do not exist in a knowledge base and the trigger word pairs are not matched with semantic types calculated by TMR can occur.

(1) When a trigger word pair cannot be queried in the FrameNet semantic unit of the knowledge base map, the trigger word pair is filtered as noise.

(2) When the trigger word pairs can be queried in the FrameNet semantic units mapped by the knowledge base, the words or phrases in the semantic units corresponding to the frames can be used for expanding the scale of the trigger word pairs, so that the labeling scale of the event relations is expanded.

(1) And randomly selecting a part of the obtained automatic labeling large-scale corpus to be passed to an expert for inspection, wherein the reserved correct data is used as fine tuning and large-scale back labeling of an auxiliary model of a high-quality data set of manual labeling in a subsequent process, so that the quality of the large-scale automatic labeling data is improved.

(2) And (3) carrying out event relation extraction, pre-training the large-scale corpus as pre-training corpus, carrying out fine adjustment on the artificially marked high-quality data, and carrying out back-labeling on the large-scale corpus by using a fine-adjusted event relation extraction model to finally obtain the large-scale high-quality biomedical event relation corpus.

In a specific example, fig. 1 is a schematic diagram of a knowledge-base-based automatic labeling method for large-scale biological event relation corpus.

Step one: biomedical text acquisition and preprocessing

First, 21,073,378 abstracts of open-sourced massive biomedical documents are downloaded from PubMed official websites, and 570,035 abstracts with the subject of cancer are screened and biological entities and events are extracted. In addition, the item is ordered according to the number of events in the abstract, excluding the abstract containing fewer events to ensure that the text contains relatively sufficient events for model training.

Step two: composing element paths and acquiring key element paths

(1) Composing event element paths

For a large number of biomedical texts that have been preprocessed, the element construction by extraction of biological entities and events is formalized as<Trigger word T ₁ The character type edge of element 1, element relation, element 2, character type edge of element 2, trigger word T ₂ >Is a component path of (a).

Trigger word T ₁ And element 1 belongs to event E ₁ Trigger word T ₂ And element 2 belongs to event E ₂ The character type edge refers to the relationship between the internal elements of the biological event and the trigger word unit, and can be obtained in the event extraction process. Meanwhile, element relationships in element path edges can be obtained through complex semantic networks in the knowledge base. In order to effectively utilize the structured information of more complex fine-grained elements (entities and events) and improve the labeling quality, the invention uses a plurality of element paths to label complex events.

(2) Defining a key element path:

the invention defines the K most important element paths as key element paths. To evaluate the importance of element relation paths, a new metric, the key element path ratio (krap), is proposed, which is mainly composed of two factors: element path importance (APS) and event relationship correlation (ERR). The specific algorithm is shown in the following table. The trigger word semantic matching template based on the key element path set can be constructed through the obtained key element path set.

The beneficial effects of adopting the further scheme are as follows: because of the complexity of biological events, element path structures contain element information of complex events, and complex event relationships are marked by utilizing a plurality of element paths, so that more complex fine-grained element (entity and event) structural information can be effectively utilized, and marking quality is improved.

Step three: trigger word semantic matching

In order to select the combination which is most in line with the trigger word semantics, the invention provides a trigger word semantic matching index-trigger word matching rate (TMR). And defining the trigger word pair set generated by matching as EP and the corresponding semantic type pair set as TPS. TMR is composed of two parts: trigger word pair candidate frequency (TPCF) and trigger word semantic matching frequency (TPMF), the specific algorithms are shown in the following table.

The beneficial effects of adopting the further scheme are as follows: because of the relatively small number of biological event elements, there may be instances where the element relationship paths are highest in proportion simultaneously in multiple sets of key element paths, meaning that the event pair may contain multiple trigger word sense templates. By adopting the statistical index, the method can select the most consistent trigger word meaning type combination to determine the optimal event relation type.

Step four: event relationship expansion and noise filtering

In the process of labeling the biological event relationship, part of trigger word pairs do not exist in the knowledge base and need to be additionally expanded. Moreover, the condition that trigger word pairs are not matched with TMR calculated semantic type pairs exists in a knowledge base, and noise can be introduced during corpus labeling.

Aiming at the problems, the invention filters and expands the trigger words by means of the framework of the universal semantic knowledge resource FrameNet and by calculating the mapping from the semantic type pairs of the knowledge base to the semantic unit pairs of the FrameNet.

(1) If a trigger word pair in the text cannot be queried in the FrameNet semantic unit of the knowledge base map, the trigger word pair is filtered as noise.

(2) When the trigger word pairs can be queried in the FrameNet semantic units of the knowledge base mapping, the words or phrases in the semantic units of the knowledge base mapping corresponding to the frames can be used for expanding the scale of the trigger word pairs, so that the labeling scale of the event relations is expanded.

Step five: expert inspection and model back-labeling using two-stage training event relationship extraction

According to the method, a large-scale automatically marked biological event relation corpus is obtained, and 189447 pieces of data are taken together. According to the invention, by referring to the Bootstrapping concept, the rest 169447 pieces of data are used as the pre-training corpus, and a two-stage training and label returning method is adopted to further improve the accuracy of automatic labeling of the corpus. The specific flow is as follows:

(1) In the first stage, the T5 model is used for event relation extraction, and pretraining is carried out on the basis, so that the model can learn the characteristics of complex biological events, and the task of biomedical event relation extraction is better adapted to the model. Compared with the commonly used BERT model, the T5 model also has a Decoder part, has larger parameter quantity, and can more effectively extract the characteristics of complex events in the text.

(2) In the second stage, a pre-trained model is used to fine tune on the high quality data to reduce the deviation of predictions from true results. The dataset is 8000 pieces of correct data that are kept through expert verification, and is divided into training, validation, test sets for fine tuning at this stage.

Finally, the high-precision event relation extraction model after two-stage training is used for carrying out back labeling on the large-scale corpus, and 177,447 large-scale high-quality biomedical event relation corpora are obtained.

The result of the training stage of the event relation extraction model shows that the macro_avg_F1 value after fine adjustment is 89.49% only on the divided training set, and the experimental result of the pre-training and fine adjustment training is 95.89%, which is 6.40% higher than that of direct fine adjustment, so that the effect of the biological event relation extraction model can be greatly improved by utilizing large-scale corpus for pre-training and direct training.

And evaluating part of the corpus by an expert, so that the quality of the finally obtained biological event relation corpus can be verified. 5000 pieces of data are randomly selected from the corpus after the back labeling and are checked by an expert, the accuracy of the corpus after the back labeling is 95.32%, 31.02% is improved compared with the accuracy of 64.3% of the corpus only after the automatic labeling, and the accuracy of the corpus automatically labeled is greatly improved. In addition, the invention also determines whether a relationship exists between two entities linked by the knowledge units by detecting whether the knowledge units are connected by edges in a knowledge base semantic network or not based on a knowledge unit mapped onto biomedical text entities by the knowledge base based on a remote supervision assumption in entity relationship extraction. Finally, 2,024,044-biomedical entity relation is obtained from the determined large-scale event relation corpus by a remote supervision method, and multi-path semantic connection among event trigger words is realized according to the entity relation. In order to ensure labeling quality, 5000 entity relations are randomly selected and are verified by an expert, and the accuracy of the entity relations is estimated to be 96.2%.

Finally, through the steps, the invention obtains a high-quality large-scale event relation corpus in the biomedical field, and the corpus has the multi-task capability of simultaneously extracting the biological event relation, extracting the entity relation, reasoning knowledge and the like. While the invention has been described with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for automatically constructing a large-scale event relation corpus in the biomedical field is characterized by comprising the following steps of

S10, obtaining biological entities and event relations from biomedical texts;

2. The method for automatically constructing a large-scale event relation corpus in biomedical fields according to claim 1, wherein in step S30, a key element path set is calculated according to a key element path proportion, specifically comprising

KRARP _i，j ＝APS _i，j *ERR _i

3. The method for automatically constructing a large-scale event relational corpus in biomedical fields according to claim 2, wherein step S40 specifically comprises

TMR _i，j ＝TPCF _i，j *TPMF _i

wherein: count (EP) _i ,TPS _j ) Representing the ith trigger word pair EP in text _i At the jth event semantic type pair TPS _j The number of samples below; count (TPS) _j ) Representing a jth semantic type pair TPS in a document _j The number of pairs containing all trigger words; sum (ETP) represents the total number of semantic type pairs in the set of semantic type pairs ETP; count (ETPS) _i ) Representing pairs of containing trigger words EP _i Semantic type pairs of (c).

4. The method for automatically building a large-scale event relational corpus in biomedical fields according to claim 3, wherein the formalized format of element paths is as follows:

5. The method for automatically constructing a large-scale event relation corpus in a biomedical field according to claim 4, further comprising S50. Performing event relation expansion and noise filtering according to the query relation of trigger words to the FrameNet semantic units mapped with the knowledge base.

6. The method for automatically constructing a large-scale event relationship corpus in a biomedical field of claim 5, wherein the event relationship expansion and noise filtering comprises

7. The method for automatically constructing a large-scale event relationship corpus in a biomedical field according to claim 6, further comprising s60. Expert checking and extracting model back labels using two-stage training event relationships.

8. The method for automatically constructing large-scale event relation corpora in the biomedical field according to claim 7, wherein in step S60, the expert checks a part of the large-scale automatically labeled biological event relation corpora to verify whether the automatically labeled result is accurate, and the expert verifies the accurate data to form a high-quality data set;

9. The method for automatically constructing a large-scale event relationship corpus in a biomedical field according to claim 8, wherein the training and the back labeling of the event relationship extraction model by adopting two stages comprises

10. A large-scale event relational corpus in the biomedical field, characterized in that it is constructed by the method according to any one of claims 1-9.