CN117436522A - Biological event relation extraction method and large-scale biological event relation knowledge base construction method of cancer subject - Google Patents

Biological event relation extraction method and large-scale biological event relation knowledge base construction method of cancer subject Download PDF

Info

Publication number
CN117436522A
CN117436522A CN202311563994.3A CN202311563994A CN117436522A CN 117436522 A CN117436522 A CN 117436522A CN 202311563994 A CN202311563994 A CN 202311563994A CN 117436522 A CN117436522 A CN 117436522A
Authority
CN
China
Prior art keywords
data
event
prompt
sample
biological event
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311563994.3A
Other languages
Chinese (zh)
Inventor
李丽双
宁婉廷
向毅
冯大鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN202311563994.3A priority Critical patent/CN117436522A/en
Publication of CN117436522A publication Critical patent/CN117436522A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Machine Translation (AREA)

Abstract

A biological event relation extraction method and a large-scale biological event relation knowledge base construction method of cancer subjects belong to the field of natural language processing, and are characterized in that S10, for a class imbalance data set, training a data generation model based on prompt learning to generate subclass sample data; s20, screening the subclass sample data generated by the data generation model through a classification model based on a prototype network, and supplementing the screened and reserved subclass sample data into a class unbalanced data set to generate an enhanced data set; s30, training an event relation extraction model by using the enhanced data set, and classifying biological event relations by the trained event relation extraction model to relieve the influence of the existing research on the extraction performance of the biological event relations, wherein the influence of the existing research on the extraction performance of the biological event relations is not fully solved.

Description

Biological event relation extraction method and large-scale biological event relation knowledge base construction method of cancer subject
Technical Field
The invention belongs to the field of natural language processing, relates to a method for extracting biological event relations aiming at category imbalance, and further relates to a method for constructing a biological event relation knowledge base by applying the extraction method, in particular to a method for constructing a large-scale biological event relation knowledge base comprising a biological event set, a biological event relation triplet set, a biological entity set and a biological entity relation set.
Background
The biological event relation knowledge base stores complete biological events and specific relations among the events. Biological events consist of trigger words, elements, element roles, etc., where trigger words are typically text references that clearly indicate the occurrence of an event, which are typically verbs or nouns; an element is typically a biomedical entity or trigger word of another event that is a participant or attribute of the event; element roles generally refer to semantic relationships between elements and events (G.Frisoni, G.Moro and A. Carbonaro, "A Survey on Event Extraction for Natural Language Understanding: riding the Biomedical Literature Wave," in IEEE Access, vol.9, pp.160721-160757,2021, doi: 10.1109/ACCESS.2021.3130956.). Biological event relationship extraction is an important means for obtaining event relationship knowledge in biological text data, and is also a key technology for constructing a large-scale biological event relationship knowledge base.
The problem of category imbalance exists in biological event relation extraction, particularly in the biomedical field, and the relation extraction in the biomedical field can be divided into biomedical entity relation triplet extraction and biological event relation extraction. Biomedical entity relationship triplets refer to relationships in the biomedical field that consist of two entities, generally denoted as (entity 1, relationship, entity 2). Wherein entity 1 and entity 2 generally refer to biomedical entities such as genes, proteins, drugs, diseases, and the like; relationships then represent the association between entity 1 and entity 2, such as "treatment," "inhibition," "promotion," etc. The representation mode of the triplet form can be used for constructing biomedical knowledge maps, supporting the tasks of drug discovery, disease diagnosis and the like. In addition, the event has richer and deeper semantic information than the entity, and the task of extracting the biological event relationship in the biomedical field is more complex and difficult than the extraction of the entity relationship in the biomedical field. Biological event relationships in the biomedical field generally refer to relationships consisting of two events (or processes) in the biological field, generally expressed in the form of (event 1, relationship, event 2). This triplet representation can be used to describe interactions, effects and dependencies between different events in biology. In biology, events may be various biological processes, molecular interactions, cell signaling, etc. The relationship then represents the connection between event 1 and event 2, e.g. "regulate", "activate", "inhibit", etc. By constructing the biological event relation triplet, scientific researchers can be helped to better utilize the biological event relation for research.
In the biomedical field, detecting relationships between biological events in text has been a subject of constant interest. Early approaches were based primarily on rule matching, neural networks and machine-learned correlation algorithms, and combinations of these approaches. In recent years, application of a pre-training language model enables a plurality of natural language processing tasks to achieve optimal performance, and a pre-training language model BioBERT aiming at the biological field is also widely applied in biological event relation extraction tasks. However, the existing task of extracting biological event relationships generally has the problem of unbalanced categories, which also creates challenges for training a more robust model of extracting biological event relationships. To alleviate the limitations of the class imbalance problem on the biological event relationship extraction model, previous work (a. Akkasi and m. -f. Moons, "Causal relationship extraction from biomedical text using deep neural models: A comprehensive survey," Journal of Biomedical Informatics, vol.119, p.103820, 2021.) proposed to use over-sampling to augment training data to supplement subclass samples in a training set. However, this approach does not adequately resolve the negative impact of unbalanced data on relationship extraction, as they do not take into account the severe limitations faced by conventional enhancement methods when the dataset size is small, i.e., simply repeating existing data can introduce overfitting problems. Conventional text data enhancement methods can be broadly divided into two main methods. The simplest approach is to make local modifications to existing samples, most typically EDA (simple data enhancement) (J.Wei and K.Zou, "Eda: easy data augmentation techniques for boosting performance on text classification tasks," in Proceedings of the 2019Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp.6382-6388.). However, simple adjustments to sentences alone do not solve the word ambiguity problem, and excessive modifications may affect the semantics of sentences. Another approach is to generate the entire sentence, which includes using a variational automatic encoder (Chadebec C, allassonni re S.Data augmentation with variational autoencoders and manifold sampling [ C ]// Deep Generative Models, and Data Augmentation, labelling, and Imperfections: first workbench, DGM4MICCAI 2021,and First Workshop,DALI 2021,Held in Conjunction with MICCAI 2021,Strasbourg,France,October1,2021,Proceedings 1.Springer International Publishing,2021:184-192), reverse translation (Sugiyama A, yoshinaga N.Data augmentation using back-translation for context-aware neural machine translation [ C ]// 3835 [ C ]// Proceedings of the fourth Workshop on discourse in machine translation (DiscoMT 2019). 2019:35-44 ]), or using a dependency tree or set rules to rewrite the original sentence (Xu J, cui J, li J, et al entity Aware Syntax Tree Based Data Augmentation for Natural Language Understanding [ J ]. ArXiv preprint arXiv:2209.02267,2022.). By modifying the whole sentence, the expanded sample can preserve the semantics of the original sentence. However, these methods are poorly controllable, have limited diversity, and typically require a large amount of training data. In recent years, with the development of deep learning and pre-training models, a paradigm of fine tuning after pre-training is gradually applied to text data enhancement technology. (Garg S, ramakrishenan g.bae: bert-based adversarial examples for text classification J arXiv preprint arXiv:2004.01970,2020.) masking a portion of words in the original sentence with a masking language model, and then prompting the model to predict the masked portion to generate a new sentence. In addition, there are also many efforts such as (Ding B, liu L, bing L, et al DAGA Data augmentation with a generation approach for low-resource tagging tasks [ J ]. ArXiv preprint arXiv:2011.01549,2020 ]) (Bayer M, kaufhold M A, buchhold B, et al Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers [ J ]. International journal of machine learning and cybernetics,2023,14 (1): 135-150 ]) to generate new sentences using generative models. With the rapid development of generating models in recent years, a great deal of research has shown that these models can generate high-quality text data even by fine-tuning a small training data set. At present, no work is done in a biological event relation extraction task to enhance data by using a generating model so as to improve the influence of class unbalance.
There are many biomedical field ontologies and databases, such as Patheway and UMLS knowledge base systems, but the existing knowledge base only performs semantic connection on event trigger words, and has not considered interaction relations among events from the viewpoint of event integrity. Meanwhile, due to the complexity of biological events and their relationships, event relationship knowledge bases for the biomedical field are still lacking at present. Existing biomedical knowledge bases are mostly limited to interactions within a single biological event, such as between molecules (e.g., INSD, EMBL, genBank, etc.), cell-cell interactions (e.g., cellTalkDB, cellPhoneDB, MACC, etc.), and lack of interactions between these biological events; in addition, in the cancer-related field, the existing databases mainly focus on molecular mutation patterns related to the occurrence and development of cancer and genomic, transcriptomic and epigenetic information (such as TCGA, ICGC, etc.) of tumors and subtypes thereof, and no intensive studies have been made on the association of genes with other biological events.
From the above analysis, the current biological event relationship extraction research has a relatively abundant result, but the event relationship extraction problem of unbalanced category needs to be further studied, and in addition, the knowledge base construction aiming at the cancer subject needs to be perfected. The project is cooperated with field specialists to extract the relation among biological events of different layers including molecules, cells, tissues, organs and the like from biomedical documents, and construct a more accurate and more detailed large-scale biological event relation knowledge base related to cancers.
Summary of the invention
To address the problem of category imbalance in a biological event relationship extraction task, in accordance with some embodiments of the present application, a method for biological event relationship extraction based on prompt learning and prototype networks includes
S10, training a data generation model based on prompt learning for a class unbalanced data set to generate subclass sample data;
s20, screening the subclass sample data generated by the data generation model through a classification model based on a prototype network, and supplementing the screened and reserved subclass sample data into a class unbalanced data set to generate an enhanced data set;
s30, training an event relation extraction model by using the enhanced data set, and classifying biological event relations by the trained event relation extraction model.
A method for extracting biological event relationships based on prompt learning and prototype networks according to some embodiments of the present application, wherein: step S10 specifically comprises
S101, constructing a balance subset according to a category imbalance data set;
s102, splicing the double-head prompt templates at two ends of each data sample of the balance subset;
s103, training a data generation model by using the data samples spliced by the double-head prompt templates, so that the data generation model generates corresponding head events, tail events and texts containing the events according to given event relations, and subclass sample data is obtained.
According to a method for extracting biological event relationships based on prompt learning and prototype networks in some embodiments of the present application, in step S101, constructing a balanced subset according to a class imbalance dataset includes:
each relationship category R in the statistical category imbalance dataset D i Number N in the overall sample i
Randomly sampling samples corresponding to each relation category to obtain a balance subset D b Wherein the sampled data amount N undersampling The method comprises the following steps:
N' i =N undersampling =min{N 1 ,N 2 ,...N n }
wherein N is the number of categories in the dataset, N' i For the sampled balanced subset D b The number of samples of the i-th class in (a).
According to the biological event relation extraction method based on prompt learning and prototype network in some embodiments of the present application, in step S102, a dual-headed prompt template 1 、prompt 2 Represented by the formula:
prompt 1 ="Relation:{R i }.Head event:{E 1 i }.Tail event:{E 2 i }."
prompt 2 ="The relation between E1{E 1 i }and E2{E 2 i }is{R i }."
wherein E is 1 i 、E 2 i Representing the head event and the tail event, respectively, of the ith data sample, R i Representing event relationships for an ith data sample;
each data sample X i Final input text I i The treatment is as follows:
according to the method for extracting biological event relationships based on prompt learning and prototype networks in some embodiments of the present application, in step S103, the objective of training the data generation model is to minimize the following loss functions:
wherein t is i Representing the ith token word, num, in the input text I i A token list length representing the text, where P (t i |t i-1 ,t i-2 ,...,t 1 ) Represented by the formula:
in the formula, o i Represents the last layer output of the ith token word in the fully connected layer, o j Representing the output of the jth token word at the last layer in the full connection layer, |V| represents the vocabulary length, tp is an adjustable super-parameter temperature coefficient;
training the resulting data to generate model M generator Generating subclass sample data D using double-ended hint consistency rules g The double-ended hint consistency rule is expressed by the following formula:
Rule:prompt 1 [E 1 ;E 2 ;R]===prompt 2 [E 1 ;E 2 ;R]
in the formula, promt i [E 1 ;E 2 ;R]Indicating the head event, tail event and event relationship information in the ith sample.
A method for extracting biological event relationships based on prompt learning and prototype networks according to some embodiments of the present application, wherein: step S20 specifically comprises
S201, dividing the balance subset into a support set and a query set according to a proportion;
s202, calculating prototype representations for each category;
s203, calculating the distance between vector representation of the sample in the query set and prototype representation of the category in the support set;
s204, calculating the probability that the current sample in the query set belongs to each category;
s205, calculating cross entropy loss when the current sample is classified according to the probability, and optimizing a classification model based on a prototype network;
s206, predicting the generated subclass sample data after a complete round of training based on the classification model of the prototype network, and storing a prediction result;
s207, repartitioning the support set and the query set, repeating the steps to perform the next training and prediction, and setting a threshold k to enable the training and prediction process to be repeated k times;
s208, screening the subclass sample data, reserving data with the same times of more than or equal to k/2 of the subclass data labels and the predictive labels, and supplementing the data into a training set of a data set to obtain an enhanced data set.
A method for extracting biological event relationships based on prompt learning and prototype networks according to some embodiments of the present application, wherein:
balance subset D in step S201 b According to 1:1 scale division support set S d And query set Q d
For each category C in step S202 i ∈C={C 1 ,C 2 ,...,C k I=1, 2, k, k denote the number of categories, prototype denotes classprotype (C i ) Represented by the formula:
where H (x) is the vector representation of sample x by BioBERT encoding,finger support set S d The middle category is C i Is a sample set of (1);
calculate query set Q in step S203 d Vector representation of samples in and support set S d The distance represented by the prototype of the middle class;
in step S204, a query set Q is calculated d The probability that the current sample belongs to each class is expressed by the following formula:
in the formula, COS (& gt)To calculate the cosine similarity of two given vectors to measure the distance between the two vectors, C j Representing the j-th category.
A method for extracting biological event relationships based on prompt learning and prototype networks according to some embodiments of the present application, wherein: step S30 specifically comprises
S301, coding the enhancement data set through a pre-training language model BioBERT, and extracting hidden layer characteristics of head and tail events and the whole sentence in the coding;
s302, splicing the event features with sentence features after activating the event features through a ReLU activation function;
s303, normalizing the spliced characteristics, and transmitting the normalized characteristics into a linear layer;
s304, classifying biological event relations by the trained event relation extraction model.
According to the biological event relation extraction method based on prompt learning and prototype network in some embodiments of the present application, the hidden layer features of the head and tail events and the whole sentence in step S301 are expressed asAnd h s
Step S302 to step S304 are expressed by the following formula:
in which W is a And b a As trainable parameters, h f Is the vector that eventually enters the linear classification layer.
According to the method for constructing the large-scale biological event relation knowledge base related to the cancer in some embodiments of the application, according to the method for extracting the biological event relation based on prompt learning and prototype network, the needed event relation information in the biological event relation knowledge base of the cancer subject is constructed and stored in the knowledge base.
The beneficial effects are that: the biological event relation extraction method based on prompt learning and prototype network is used for relieving the influence of the existing research on the biological event relation extraction performance, which is not enough solved by the category imbalance problem, and constructing a biological event relation knowledge base related to cancer by using the extraction method, so that the method can be suitable for biological event relation extraction in the biomedical field.
Drawings
FIG. 1 is a diagram of a model of extraction of biological event relationships based on prompt enhancement.
FIG. 2 is a database E-R diagram.
Detailed Description
The invention firstly provides a biological event relation extraction method, namely, a section of biomedical text and an event pair marked in the text are given, and a model outputs the relation of the event pair. The specific embodiment of the model is as follows:
1. data example
Each piece of data in the data set includes "original text", "head event information", and "tail event information". "original text" is the original biomedical text containing event pairs, which provides context information of two events, and "head event information" and "tail event information" include trigger words of events, event elements, and the like. As shown in Table 2, the Head Event and Tail Event are formatted representations of "the expression of TKTL" and "the expression of LDH", respectively, where the Event class of expression is Gene expression and the entity class of TKTL1 and LDH5 is Gene_or_gene_product.
Table 2 model input data sample
2. Biological event relationship classification based on prompt enhancement
And storing the optimal model structure after the model is enhanced based on prompt in the training part, and calling the stored model to classify the biological event relationship. Firstly, a text is subjected to word segmentation and then added with [ CLS ] and [ SEP ] labels, then after being subjected to BioBERT coding, corresponding hidden layer vectors of head and tail events are extracted according to event information, and then the [ CLS ] corresponding vectors and the head and tail event vectors are spliced and then classified, and finally mapped into a relation class space to obtain a final event relation extraction result, as shown in Table 3:
TABLE 3 biological event relationship extraction results
The extraction of the biological event relationship is a necessary premise for constructing a biological event relationship triplet set, and after the event relationship extraction model performs event relationship extraction on a large-scale corpus containing biological events, the obtained event relationship triplet can be used for constructing a large-scale biological event relationship knowledge base of an applicable theme (such as a cancer related theme).
Therefore, the biological event relation extraction method based on prompt learning and prototype network is used for relieving the influence of the existing research on the biological event relation extraction performance, which is not enough solved by the category imbalance problem, and constructing a biological event relation knowledge base related to cancer by applying the extraction method.
Thus, the invention is mainly composed of two major parts:
1. biological event relationship extraction based on prompt learning and prototype network filtering.
2. And constructing a large-scale biological event relation knowledge base related to the cancer.
The technical scheme adopted by the invention is as follows:
the construction method of the cancer related biological event relation knowledge base based on the literature comprises the following steps:
acquisition of model training data
The model of the invention adopts a public biological event relation extraction data set for training (Hahn-Powell G, bell D, valenzuela-Esc rcega M A, et al, this before that: causal precedence in the biomedical domain [ J ]. ArXiv preprint arXiv:1606.08089,2016 ]), contains 827 pieces of training data, covers 5 event relations, wherein the most number of categories have 442 pieces of data, the least number of categories have 44 pieces of data, and has the characteristic of typical category imbalance.
(II) biological event relation extraction based on prompt learning and prototype network
In order to realize the event relation recognition of a large-scale biological event data set, the invention provides a biological event relation extraction model based on prompt learning and a prototype network. The model aims at relieving the influence of category unbalance on event relation classification, firstly, data generation is carried out on subclass samples in data through a data generation model based on prompt learning, secondly, the generated data is subjected to high-quality filtering through a classification model based on a prototype network to obtain high-quality enhancement data, so that the subclass samples in the category unbalance data set are supplemented and an enhancement data set is obtained, finally, the enhancement data set is encoded through a pre-training language model BioBERT, and classification is carried out through a neural network model fusing events and text characteristics, so that classification of biological event relations is achieved.
The specific flow of the model is as follows:
(1) Data generation model based on prompt learning
The data generation model mainly applies a prompt learning method to the data generation model BioGPT, and trains the data generation model to generate data according to the designed prompt information. Specifically, firstly, constructing a balance subset for generating model training, then splicing two ends of a data sample of the balance subset through a double-head prompt template, training a data generation model, finally, generating a corresponding data sample according to a given class label by using the trained data generation model, and filtering the generated sample according to a consistency rule to obtain a generated data set D g (subclass sample dataset) with specific flow descriptions such asThe following steps:
(1) construction of balance subset D b : the existing class-unbalanced dataset D is first undersampled, for each relationship class R in the dataset D i Count its number N in the population sample i Then randomly sampling the samples corresponding to each relation category to obtain a balance subset D b Sampled data quantity N undersampling The method comprises the following steps:
N' i =N undersampling =min{N 1 ,N 2 ,...N n } (2.1)
where N is the number of categories in the dataset, N' i For sampled subset D b In the i-th class of samples, this operation ensures the balance of the training data of the data generation model to ensure that the model does not result in biased predictions of samples due to class imbalance problems during the training and generation stages, thereby generating samples that more conform to a given tag.
(2) Data generation based on double-headed cues: for the constructed balance dataset D b Splice in balance dataset D through a double-headed prompt template b And training the two ends of each data sample by using the spliced data samples to guide the generated model BioGPT to learn structural knowledge from given prompts, namely generating corresponding head and tail events and texts containing the events according to given event relations by using a training model. Double-head prompt template 1 、prompt 2 The design is as follows:
prompt 1 ="Relation:{R i }.Head event:{E 1 i }.Tail event:{E 2 i }." (2.2)
prompt 2 ="The relation between E1{E 1 i }and E2{E 2 i }is{R i }." (2.3)
wherein E is 1 i 、E 2 i Representing the head event and the tail event of the ith data sample, respectively. For each data sample X i Final input text I of it i The treatment is as follows:
the double-head prompts are spliced on two sides of the data sample respectively for generating the subsequent training of the model. In the training phase, a generative pre-training language model BioGPT is fine-tuned, which generates sequences autoregressively using a causal language model, the goal of the training being to minimize the following loss functions:
wherein t is i Representing the ith token word, num, in the input text I i A token list length representing the text, i.e. the words contained, where P (t i |t i-1 ,t i-2 ,...,t 1 ) The calculation mode of (a) is as follows:
wherein o is i Representing the output of the ith token word in the last layer of the full connection layer, wherein V is the length of the vocabulary, tp is an adjustable super-parameter temperature coefficient for avoiding the model from falling into a local optimal solution;
the invention generates the model M after training generator Preserving, M generator I.e. the final model used to generate the data. Specifically, the present invention takes as input a data tag, i.e., an event relationship, and generates text in a format as shown in formula (2.4) with a desired model; meanwhile, in order to ensure the quantity and availability of the generated data, the invention sets a rule for checking the availability of the data, wherein the rule is used for judging whether the event trigger words in the generated prompting template exist in the generated sentence text or not so as to ensure that the generated data accords with basic semantic common knowledge, namely, the event trigger words contained in the generated double-headed prompting exist in the given sentence text at the same time; as same asSetting a quantity threshold t=500 for controlling the quantity of usable data generated for each category; for the available data of the passing rules, the invention further utilizes the double-head prompt consistency rule to filter out higher quality data which is obtained as the final generated data D of the module g The double-ended hint consistency rules can be described by the following formula:
Rule:prompt 1 [E 1 ;E 2 ;R]===prompt 2 [E 1 ;E 2 ;R] (2.7)
wherein prompt is i [E 1 ;E 2 ;R]Refers to the head event, tail event and event relation information in the ith sample;
(2) Data filtering module based on prototype network
To obtain higher quality enhancement data, the present invention constructs and trains a prototype network-based data filter (classification model). First, balance subset D b According to 1:1 scale division support set S d And query set Q d Then for each category C i ∈C={C 1 ,C 2 ,...,C k First calculate its prototype representation ClassPrototype (C) i ):
Where H (x) is the vector representation of sample x by BioBERT encoding,the finger support centralization category is C i Is a sample set of (1); then by calculating Q d Vector representation of the middle samples and S d Distance of prototype representation of middle class to calculate Q d The probability that the current sample belongs to each class is given as an example for any sample:
the COS (·) is used to calculate the cosine similarity of two given vectors to measure the distance between the two vectors, and the rest of the samples in the query set are the same. The cross entropy loss of the sample when classified is then calculated based on this probability, thereby optimizing the overall model. After a complete round of model training, data D is generated g Predicting and storing the prediction result, then re-dividing the support set and the query set, repeating the above steps for the second training and prediction, setting a threshold k to repeat the training-prediction process k times, then screening the generated data, and reserving a data set with the generated data label equal to or more than k/2 times as the predicted label, wherein the data set is the final enhanced data D filtered by the prototype network a
(3) Event relationship classification module for fusing enhanced samples of event and text features
The invention realizes the final biological event relation extraction function, namely the enhanced data D obtained in the last step a Supplementing the training set with the original training set, and then training the event relation extraction model. Firstly, using a pre-training language model BioBERT to code data, then extracting hidden layer characteristics of head and tail events and whole sentences in the code, and respectively marking the hidden layer characteristics asAnd h s . The event features are spliced with sentence features after being activated by the ReLU activation function, the spliced features are normalized, and finally the normalized features are transmitted into a linear layer for classification. The specific calculation process of the module is as follows:
wherein W is a And b a As trainable parameters, h f Is the vector that eventually enters the linear classification layer.
The biological event relation extraction method based on prompt learning enhancement and prototype network filtering can relieve the influence of category imbalance on the model, and can solve the problem of biased prediction caused by category imbalance in the classification model by introducing a data enhancement method with data diversity. The invention obtains a macro F1 value of 43.19% in a public biological event relation extraction data set (Hahn-Powell G, bell D, valenzuela-Esc rc ega M A, et al, this before that: causal precedence in the biomedical domain [ J ]. ArXiv preprint arXiv:1606.08089,2016 ]), and obtains an optimal result on the task. Furthermore, experiments performed on this public data set showed that the BioBERT model macro F1 values, which were not data enhanced using the method of the invention, were only up to 37.93%. The macrof 1 value of the BioBERT model, which is only subjected to the operation of generating subclass sample data but not subjected to the filtering operation (the screening operation of the classification model based on the prototype network), can reach 42.57%, which indicates the effectiveness of the prompt-based data enhancement method and the necessity of the prototype network-based filtering method.
A model structure diagram of the event relation extraction method of the invention is shown in FIG. 1.
(III) construction of a knowledge base of the relationship between large-scale biological events related to cancer
Event relationship information required in a biological event relationship knowledge base for constructing a cancer theme is obtained through the method, and is stored in the knowledge base. The constructed biological event relation knowledge base comprises the following contents:
(1) Biological event set
Event trigger words, elements, element roles and event types constitute a biological event set e= { E i |i∈1,2,…,n},E i =(tri i ,(arg 1 ,role 1 ),(arg 2 ,role 2 ),…,(arg k ,role k ),E_type i ) Is a concrete oneIs a biological event of (a). Wherein tri i Is biological event E i Trigger word (arg) k ,role k ) Representing biological event E i Is the element arg of (2) k Element role corresponding to element role k ,E_type i Representing biological event E i Event type of (a).
(2) Biological event relationship triplet sets
Based on the biological event relation extraction model provided by the project, extracting the biological event relation, and forming a triplet set ER= { < E by the extracted event relation i ,R ij ,E j >|E i ,E j ∈E,R ij E R }, where E = { E i I e 1,2, …, n } is a set of biological events, r= { R 1 ,R 2 ,…,R m And is a set of event relationship types.
(3) Biological entity set
All biological entities related to the event relationship constitute a set of biological entities e= { e 1 ,e 2 ,…,e T }。
(4) Biological entity relationship set
Based on the existing model of the extraction of biological entity relationships (Li L, lian R, lu H, et al document-level biomedical relation extraction based on multi-dimensional fusion information and multi-granularity logical reasoning [ C)]v/Proceedings of the 29th International Conference on Computational Linguistics.2022:2098-2107.), extracting the relation of the biological entities to obtain a relation set enrn= { < En between the pairs of biological entities i ,Rn ij .En j >|En i ,En j ∈En,Rn ij E Rn, where en= { en|i e 1, 2..n } is a set of biological entities, rn= { Rn 1 ,Rn 2 ,...,Rn n And is a set of entity relationship types.
The constructed cancer-related biological event relation knowledge base comprises a biological event table, a biological event relation table, a biological entity table and the like, and the specific tables are as follows:
table 1 knowledge base table
While the invention has been described with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A biological event relation extraction method based on prompt learning and prototype network is characterized by comprising the following steps of
S10, training a data generation model based on prompt learning for a class unbalanced data set to generate subclass sample data;
s20, screening the subclass sample data generated by the data generation model through a classification model based on a prototype network, and supplementing the screened and reserved subclass sample data into a class unbalanced data set to generate an enhanced data set;
s30, training an event relation extraction model by using the enhanced data set, and classifying biological event relations by the trained event relation extraction model.
2. The prompt learning and prototyping network-based biological event relationship extraction method of claim 1 wherein: step S10 specifically comprises
S101, constructing a balance subset according to a category imbalance data set;
s102, splicing the double-head prompt templates at two ends of each data sample of the balance subset;
s103, training a data generation model by using the data samples spliced by the double-head prompt templates, so that the data generation model generates corresponding head events, tail events and texts containing the events according to given event relations, and subclass sample data is obtained.
3. The method for extracting biological event relationships based on prompt learning and prototyping networks according to claim 2, wherein constructing the balanced subset from the class imbalance dataset in step S101 comprises:
each relationship category R in the statistical category imbalance dataset D i Number N in the overall sample i
Randomly sampling samples corresponding to each relation category to obtain a balance subset D b Wherein the sampled data amount N undersampl ing The method comprises the following steps:
N' i =N undersampl ing =min{N 1 ,N 2 ,...N n }
wherein N is the number of categories in the dataset, N' i For the sampled balanced subset D b The number of samples of the i-th class in (a).
4. The method for extracting biological event relationship based on prompt learning and prototype network as claimed in claim 3, wherein in step S102, the dual-headed prompt template prompt 1 、prompt 2 Represented by the formula:
prompt 1 ="Relation:{R i }.Head event:{E 1 i }.Tail event:{E 2 i }."
prompt 2 ="The relation between E1{E 1 i }and E2{E 2 i }is{R i }."
wherein E is 1 i 、E 2 i Representing the head event and the tail event, respectively, of the ith data sample, R i Representing event relationships for an ith data sample;
each data sample X i Final input text I i The treatment is as follows:
5. the method for extracting biological event relationships based on prompt learning and prototype networks according to claim 4, wherein in step S103, the training of the data generation model is aimed at minimizing the following loss functions:
wherein t is i Representing the ith token word, num, in the input text I i A token list length representing the text, where P (t i |t i-1 ,t i-2 ,...,t 1 ) Represented by the formula:
in the formula, o i Represents the last layer output of the ith token word in the fully connected layer, o j Representing the output of the jth token word at the last layer in the full connection layer, |V| represents the vocabulary length, tp is an adjustable super-parameter temperature coefficient;
training the resulting data to generate model M generator Generating subclass sample data D using double-ended hint consistency rules g The double-ended hint consistency rule is expressed by the following formula:
Rule:prompt 1 [E 1 ;E 2 ;R]===prompt 2 [E 1 ;E 2 ;R]
in the formula, promt i [E 1 ;E 2 ;R]Indicating the head event, tail event and event relationship information in the ith sample.
6. The method for extracting biological event relationships based on prompt learning and prototyping networks of any one of claims 1-5 wherein: step S20 specifically comprises
S201, dividing the balance subset into a support set and a query set according to a proportion;
s202, calculating prototype representations for each category;
s203, calculating the distance between vector representation of the sample in the query set and prototype representation of the category in the support set;
s204, calculating the probability that the current sample in the query set belongs to each category;
s205, calculating cross entropy loss when the current sample is classified according to the probability, and optimizing a classification model based on a prototype network;
s206, predicting the generated subclass sample data after a complete round of training based on the classification model of the prototype network, and storing a prediction result;
s207, repartitioning the support set and the query set, repeating the steps to perform the next training and prediction, and setting a threshold k to enable the training and prediction process to be repeated k times;
s208, screening the subclass sample data, reserving data with the same times of more than or equal to k/2 of the subclass data labels and the predictive labels, and supplementing the data into a training set of a data set to obtain an enhanced data set.
7. The prompt learning and prototyping network-based biological event relationship extraction method of claim 6 wherein:
balance subset D in step S201 b According to 1:1 scale division support set S d And query set Q d
For each category C in step S202 i ∈C={C 1 ,C 2 ,...,C k I=1, 2, k, k denote the number of categories, prototype denotes classprotype (C i ) Represented by the formula:
where H (x) is the vector representation of sample x by BioBERT encoding,finger support set S d The middle category is C i Is a sample set of (1);
calculate query set Q in step S203 d Vector representation of samples in and support set S d The distance represented by the prototype of the middle class;
in step S204, a query set Q is calculated d The probability that the current sample belongs to each class is expressed by the following formula:
in the formula, COS (·) is used to calculate the cosine similarity of two given vectors to measure the distance between the two vectors, C j Representing the j-th category.
8. The prompt learning and prototyping network-based biological event relationship extraction method of claim 7 wherein: step S30 specifically comprises
S301, coding the enhancement data set through a pre-training language model BioBERT, and extracting hidden layer characteristics of head and tail events and the whole sentence in the coding;
s302, splicing the event features with sentence features after activating the event features through a ReLU activation function;
s303, normalizing the spliced characteristics, and transmitting the normalized characteristics into a linear layer;
s304, classifying biological event relations by the trained event relation extraction model.
9. The method for extracting biological event relationships based on prompt learning and prototype network as claimed in claim 8, wherein the hidden layer features of the head and tail events and the whole sentence in step S301 are expressed asAnd h s
Step S302 to step S304 are expressed by the following formula:
in which W is a And b a As trainable parameters, h f Is the vector that eventually enters the linear classification layer.
10. A method of constructing a large-scale biological event relationship knowledge base related to cancer, characterized in that event relationship information required in the biological event relationship knowledge base of a cancer subject is constructed and stored in the knowledge base according to the method of any one of claims 1 to 9.
CN202311563994.3A 2023-11-22 2023-11-22 Biological event relation extraction method and large-scale biological event relation knowledge base construction method of cancer subject Pending CN117436522A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311563994.3A CN117436522A (en) 2023-11-22 2023-11-22 Biological event relation extraction method and large-scale biological event relation knowledge base construction method of cancer subject

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311563994.3A CN117436522A (en) 2023-11-22 2023-11-22 Biological event relation extraction method and large-scale biological event relation knowledge base construction method of cancer subject

Publications (1)

Publication Number Publication Date
CN117436522A true CN117436522A (en) 2024-01-23

Family

ID=89551497

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311563994.3A Pending CN117436522A (en) 2023-11-22 2023-11-22 Biological event relation extraction method and large-scale biological event relation knowledge base construction method of cancer subject

Country Status (1)

Country Link
CN (1) CN117436522A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118097442A (en) * 2024-04-29 2024-05-28 大连理工大学 Method for recognizing generalization of few-sample remote sensing target through diversity prompt learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118097442A (en) * 2024-04-29 2024-05-28 大连理工大学 Method for recognizing generalization of few-sample remote sensing target through diversity prompt learning

Similar Documents

Publication Publication Date Title
Nadif et al. Unsupervised and self-supervised deep learning approaches for biomedical text mining
Abdelgwad et al. Arabic aspect based sentiment analysis using bidirectional GRU based models
CN108984724B (en) Method for improving emotion classification accuracy of specific attributes by using high-dimensional representation
CN108897857B (en) Chinese text subject sentence generating method facing field
CN109887484B (en) Dual learning-based voice recognition and voice synthesis method and device
CN108255805B (en) Public opinion analysis method and device, storage medium and electronic equipment
CN107229610A (en) The analysis method and device of a kind of affection data
CN111126040B (en) Biomedical named entity recognition method based on depth boundary combination
Bellegarda et al. State of the art in statistical methods for language and speech processing
Shuang et al. A sentiment information Collector–Extractor architecture based neural network for sentiment analysis
CN111222318B (en) Trigger word recognition method based on double-channel bidirectional LSTM-CRF network
CN111274790A (en) Chapter-level event embedding method and device based on syntactic dependency graph
Wu et al. Graph capsule aggregation for unaligned multimodal sequences
Liu et al. Open intent discovery through unsupervised semantic clustering and dependency parsing
CN110298044A (en) A kind of entity-relationship recognition method
CN117436522A (en) Biological event relation extraction method and large-scale biological event relation knowledge base construction method of cancer subject
CN110852089A (en) Operation and maintenance project management method based on intelligent word segmentation and deep learning
CN111159405B (en) Irony detection method based on background knowledge
Xu et al. Short text classification of chinese with label information assisting
Meisheri et al. Sentiment extraction from Consumer-generated noisy short texts
Siddique et al. Bilingual word embeddings for cross-lingual personality recognition using convolutional neural nets
Wang et al. Weakly Supervised Chinese short text classification algorithm based on ConWea model
Jayaraman et al. Sarcasm Detection in News Headlines using Supervised Learning
CN117291193A (en) Machine translation method, apparatus and storage medium
CN113590768B (en) Training method and device for text relevance model, question answering method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination