CN113360667A - Biomedical trigger word detection and named entity identification method based on multitask learning - Google Patents

Biomedical trigger word detection and named entity identification method based on multitask learning Download PDF

Info

Publication number
CN113360667A
CN113360667A CN202110617440.1A CN202110617440A CN113360667A CN 113360667 A CN113360667 A CN 113360667A CN 202110617440 A CN202110617440 A CN 202110617440A CN 113360667 A CN113360667 A CN 113360667A
Authority
CN
China
Prior art keywords
word
sequence
trigger
named entity
ner
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110617440.1A
Other languages
Chinese (zh)
Other versions
CN113360667B (en
Inventor
苏延森
詹飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202110617440.1A priority Critical patent/CN113360667B/en
Publication of CN113360667A publication Critical patent/CN113360667A/en
Application granted granted Critical
Publication of CN113360667B publication Critical patent/CN113360667B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Animal Behavior & Ethology (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Machine Translation (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a biomedical trigger word detection and named entity identification method based on multitask learning, which comprises the following steps: 1, preprocessing an unstructured biomedical text by a word segmentation and sentence segmentation technology, and labeling the preprocessed biomedical text to generate a standard data set; 2, constructing a biomedical trigger word detection and named entity recognition neural network model based on multitask learning; 3 training the neural network model and updating parameters; and 4, predicting the unlabeled data by using the trained optimal model so as to identify the trigger words and the named entities in the unlabeled data. The method can simultaneously detect the trigger words and named entity recognition in the biomedical text, thereby effectively improving the recognition accuracy and reducing the requirements on computing resources.

Description

Biomedical trigger word detection and named entity identification method based on multitask learning
Technical Field
The invention relates to the field of biomedical text mining, in particular to a biomedical trigger word detection and named entity identification method based on multi-task learning.
Background
A named entity is a specific noun or noun phrase in text that has a particularly critical meaning. Named entity recognition can be divided into general domain and specific domain entity recognition. In the general field, entities can be divided into organization name entities, person name entities, place name entities, and the like. In the specific biomedical field, entities can be classified into cellular entities, genetic entities, protein entities, pharmaceutical entities, disease entities, and the like. Compared with named entity recognition in the general field, the named entity recognition in the biomedical field is more difficult due to entity nesting, word ambiguity and the like. Accurate identification of biomedical entities can facilitate further development of information extraction techniques and natural language processing techniques. In the biomedical field, the named entity recognition technology can extract structured biomedical entity information from a large amount of unstructured documents, and has a promoting effect on the construction of biomedical knowledge maps and databases.
The current popular named entity recognition methods can be mainly divided into rule-based methods, traditional machine learning-based methods and deep learning-based methods. The rule-based approach relies primarily on manually formulated rules including domain-specific place name dictionaries, syntactic vocabulary patterns, and the like to identify entities in text, without the need for a data set with tag annotations. The method based on the traditional machine learning mainly relies on linguistic features such as manually designed prefix and suffix features, lexical features, syntactic features and the like to train a traditional machine learning algorithm to recognize named entities. In recent years, with the advantage of deep neural networks in automatically extracting internal features of data, a plurality of named entity recognition methods based on deep learning exist. The current named entity recognition method can mainly only perform one independent entity recognition task, and the semantic information feature extraction in the text is not sufficient, so that the recognition effect of the existing method is poor.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a biomedical trigger word detection and named entity identification method based on multi-task learning, so that the total trigger words and named entity identification of a biomedical text can be simultaneously detected, the identification accuracy can be effectively improved, and the requirement on computing resources is reduced.
In order to achieve the purpose, the invention adopts the technical scheme that:
the invention relates to a biomedical trigger word detection and named entity identification method based on multitask learning, which is characterized by comprising the following steps of:
step 1, preprocessing unstructured biomedical texts:
performing word segmentation and sentence segmentation on the unstructured biomedical text to obtain a tag-free training data set consisting of n sentence sequences, and recording the tag-free training data set as S ═ S1,S2,...,Si,...,Sn}; wherein S isiRepresents the ith sentence sequence, and
Figure BDA0003092803070000021
Figure BDA0003092803070000022
represents the jth word sequence in the ith sentence sequence, and
Figure BDA0003092803070000023
representing the ith sentence sequence SiThe j-th word sequence
Figure BDA0003092803070000024
The kth character of (1); n represents the total number of sentences in the training data set, m represents the total number of words in one sentence, and K represents the total number of characters in one word;
step 2, labeling the training data set S:
step 2.1, setting the categories of the trigger words and the named entity recognition respectively as
Figure BDA0003092803070000025
And
Figure BDA0003092803070000026
wherein the content of the first and second substances,
Figure BDA0003092803070000027
denotes the nth class trigger class, Lner nRepresenting the nth entity category;
step 2.2, the ith sentence sequence S in the training data set S is simultaneously processediLabels of the trigger word category and the named entity category are added to all word sequences, so that a labeled training data set of a trigger word detection task is obtained
Figure BDA0003092803070000028
Tagged training data set for recognition tasks with named entities
Figure BDA0003092803070000029
Wherein the content of the first and second substances,
Figure BDA00030928030700000210
representing the ith sentence sequence SiThe j-th word sequence
Figure BDA00030928030700000211
And its corresponding trigger word class
Figure BDA00030928030700000212
Representing the ith sentence sequence SiThe j-th word sequence
Figure BDA00030928030700000213
And its corresponding entity class
Figure BDA00030928030700000214
Step 3, word vector pre-training:
obtaining biomedical documents, performing word segmentation and sentence segmentation processing to obtain a label-free training data set consisting of n 'sentence sequences, and recording the label-free training data set as S ═ S'1,S′2,...,S′i′,...,S′n′};S′i′Representing the ith' sentence sequence, using Word2Vec tool based on language model to eliminate the markTraining the training data set S' to obtain a pre-training word vector matrix M;
step 4, biomedical trigger word detection based on multitask learning and data processing of an MTL-TD-NER model of a named entity neural network; the MTL-TD-NER model consists of a word vector coding layer embedded based on hybrid coding, a feature extraction layer based on bidirectional LSTM and a classification layer based on conditional random fields;
step 4.1, the ith sentence sequence in the text form is processed by utilizing the pre-training word vector M
Figure BDA00030928030700000215
Word level word vector form data converted into arbitrary dimension V
Figure BDA00030928030700000216
Wherein the content of the first and second substances,
Figure BDA00030928030700000217
represents the jth word
Figure BDA00030928030700000218
Word-level word vectors of;
step 4.2, processing data based on a word vector coding layer embedded by mixed coding, wherein the word vector coding layer is composed of bidirectional long-short term memory network units;
step 4.2.1, the jth word sequence
Figure BDA0003092803070000031
Each character in the character string is input into a bidirectional long and short term memory network unit at the character level and used for training the generation probability of generating corresponding words by all the characters;
step 4.2.2, extracting the jth word sequence
Figure BDA0003092803070000032
The first character and the last character in the bidirectional long and short term memory network unit are used as the j word sequence after being connected on the output of the hidden layer in the bidirectional long and short term memory network unit
Figure BDA0003092803070000033
Character level vector of
Figure BDA0003092803070000034
Step 4.3, for the jth word
Figure BDA0003092803070000035
Word-level word vector of
Figure BDA0003092803070000036
And character level word vectors
Figure BDA0003092803070000037
Splicing to obtain the jth word sequence
Figure BDA0003092803070000038
Mixed coded word vector of
Figure BDA0003092803070000039
Thereby obtaining the ith sentence sequence SiWord vector of
Figure BDA00030928030700000310
And 4.4, processing data of the feature extraction layer based on the bidirectional LSTM:
the ith sentence sequence SiWord vector of
Figure BDA00030928030700000311
Inputting the sentence into a layer of forward LSTM network structure, and then inputting the ith sentence sequence SiWord vector of
Figure BDA00030928030700000312
Inputting into a layer of reverse LSTM network structure, and finally, inputting the jth word vector
Figure BDA00030928030700000313
Implicit in two LSTM networksLayer state output
Figure BDA00030928030700000314
And
Figure BDA00030928030700000315
the combinations are spliced together as a word vector at the j' th position
Figure BDA00030928030700000316
Context feature information of
Figure BDA00030928030700000317
Thereby obtaining the ith sentence sequence SiCharacteristic sequence of
Figure BDA00030928030700000318
And 4.5, processing data of a classification layer based on the conditional random field:
step 4.5.1, constructing a classification layer of a trigger word detection task and a classification layer of a named entity recognition task, wherein the classification layer takes a conditional random field as a basic unit; building parallel information conversion layer transform for classification layer of trigger word detection tasktd(ii) a Building parallel information conversion layer transform for classification layer of named entity recognition taskner
Step 4.5.2, feature sequence
Figure BDA00030928030700000319
Inputting the input into a classification layer of a trigger word detection task to obtain an output OtdThen output OtdInput to information transformationtdObtain the characteristic information F of the trigger wordtdFinally, the feature information F is obtainedtdAnd the characteristic sequence HiAdding the entity integral characteristics to obtain an entity integral characteristic, inputting the entity integral characteristic into a classification layer of a named entity identification task to obtain a final entity identification result
Figure BDA00030928030700000320
Step 4.5.3, mixingCharacteristic sequence
Figure BDA00030928030700000321
Inputting the input into a classification layer of a named entity recognition task to obtain an output OnerThen output OnerInput to information transformationnerTo obtain the characteristic information F of the entitynerFinally, the feature information F is obtainednerAnd the characteristic sequence HiAdding the overall characteristics of the trigger words to obtain the overall characteristics of the trigger words, inputting the overall characteristics into a classification layer of a trigger word detection task, and obtaining a final trigger word recognition result
Figure BDA0003092803070000041
Step 5, training an MTL-TD-NER model to obtain an optimal trigger word detection and named entity recognition model:
step 5.1, setting parameter variables of the model:
the batch size is B, the current number of iterations is epochnowThe maximum number of iterations is epochmaxThe number of iterations in which the difference value loss of the model does not decrease continuously is epochnoThe maximum number of iterations for the early-stop strategy is epoches
Step 5.2, parameter initialization:
initializing each parameter of a word vector layer based on mixed coding embedding, a feature extraction layer based on bidirectional LSTM and a classification layer based on conditional random field by adopting a uniform distribution method;
step 5.3, from the epochnowStarting, inputting batches with the size of B in a training data set S into an MTL-TD-NER model each time, and calculating the difference value loss of an output label of the model and a correct label in the training data set S by using the formula (1) so as to update parameters in the model;
Figure BDA0003092803070000042
in formula (1), losstdAnd lossnerIs trigger word detection and named entity recognitionA penalty function for the task, and having:
Figure BDA0003092803070000043
Figure BDA0003092803070000044
in formulae (2) and (3), ytdAnd ynerA sequence of trigger word tags and named entity tags; score (y)td) And score (y)ner) Respectively the ith sentence sequence SiInputting scores of the trigger word label and the named entity label sequence output in the MTL-TD-NER model; λ and
Figure BDA0003092803070000045
the system is a hyper-parameter and is used for balancing the importance degree between two tasks; (ii) a
Figure BDA0003092803070000046
Representing all possible sets of trigger word tag sequences,
Figure BDA0003092803070000047
represents the set of all possible entity tag sequences,
Figure BDA0003092803070000048
to represent
Figure BDA0003092803070000049
One of which triggers a sequence of word tags,
Figure BDA00030928030700000410
a certain sequence of entity tags in the representation;
step 5.4, if epochnowLess than epochmaxAnd epochnoLess than epochesThen will epochnowAdding 1, and then continuing to execute the step 5.3; if epochnowGreater than or equal to epochmaxOr epochnoEqual to epochesThen, the optimal trigger word detection and named entity recognition network model based on the multi-task learning is obtained;
and 6, identifying the unlabeled data by using the optimal trigger word detection and named entity identification network model so as to obtain a trigger word label and a named entity identification label of each word in the unlabeled data.
Compared with the prior art, the invention has the beneficial effects that:
1. the method is different from the traditional named entity identification method based on rules and machine learning, realizes an end-to-end neural network model, avoids the manual design of various rules such as lexical and syntactic rules and the manual extraction of linguistic features, and simplifies the implementation of trigger word detection and named entity identification.
2. The invention designs a neural network model to simultaneously process the trigger word detection task and the named entity recognition task, and adopts a hard parameter sharing mode to enable the two tasks to share the same word vector coding layer based on mixed coding embedding and the feature extraction layer based on bidirectional LSTM, thereby accelerating the training process of the model and improving the operation efficiency of the model.
3. The invention utilizes an information conversion layer to convert the mutually beneficial information between the trigger word and the named entity, can better mine useful characteristic information, and respectively inputs the useful characteristic information into the classification layers to help each other to better identify the trigger word and the named entity.
4. According to the method, the trigger detection task and the named entity recognition task are trained simultaneously under the multi-task learning framework, so that data enhancement can be performed implicitly, regularization is introduced, the risk of over-fitting is effectively avoided, and the recognition accuracy is improved.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
In this embodiment, a biomedical trigger word detection and named entity recognition method based on multi-task learning mainly uses a word vector coding layer based on hybrid embedding and a feature extraction layer based on bidirectional LSTM as a common part of two tasks, and then two classification layers based on conditional random fields are respectively constructed to simultaneously perform trigger word detection and named entity recognition, specifically as shown in fig. 1, according to the following steps:
step 1, preprocessing unstructured biomedical texts:
performing word segmentation and sentence segmentation on the unstructured biomedical text to obtain a tag-free training data set consisting of n sentence sequences, and recording the tag-free training data set as S ═ S1,S2,...,Si,...,Sn}; wherein S isiRepresents the ith sentence sequence, and
Figure BDA0003092803070000051
Figure BDA0003092803070000052
represents the jth word sequence in the ith sentence sequence, and
Figure BDA0003092803070000053
Figure BDA0003092803070000054
representing the ith sentence sequence SiThe j-th word sequence
Figure BDA0003092803070000055
The kth character of (1); n represents the total number of sentences in the training data set, m represents the total number of words in one sentence, and K represents the total number of characters in one word;
step 2, labeling the training data set S:
step 2.1, setting the categories of the trigger words and the named entity recognition respectively as
Figure BDA0003092803070000061
And
Figure BDA0003092803070000062
wherein the content of the first and second substances,
Figure BDA0003092803070000063
denotes the nth class trigger class, Lner nRepresenting the nth entity category;
step 2.2, the ith sentence sequence S in the training data set S is simultaneously processediLabels of the trigger word category and the named entity category are added to all word sequences, so that a labeled training data set of a trigger word detection task is obtained
Figure BDA0003092803070000064
Tagged training data set for recognition tasks with named entities
Figure BDA0003092803070000065
Wherein the content of the first and second substances,
Figure BDA0003092803070000066
representing the ith sentence sequence SiThe j-th word sequence
Figure BDA0003092803070000067
And its corresponding trigger word class
Figure BDA0003092803070000068
Representing the ith sentence sequence SiThe j-th word sequence
Figure BDA0003092803070000069
And its corresponding entity class
Figure BDA00030928030700000610
Step 3, word vector pre-training:
in order to allow word-level word vectors to contain a large amount of linguistic information, a large number of biomedical documents are downloaded from a Pumbed database, and word segmentation and sentence segmentation are performed to obtain a tagless training data set consisting of n ' sentence sequences, which is denoted as S ' ═ S '1,S′2,...,S′i′,...,S′n′};S′i′Expressing the ith 'sentence sequence, and then training S' by a Word2Vec tool based on a language model to obtain a pre-training Word vector matrix M;
step 4, biomedical trigger word detection based on multitask learning and data processing of an MTL-TD-NER model of a named entity neural network; the MTL-TD-NER model consists of a word vector coding layer embedded based on hybrid coding, a feature extraction layer based on bidirectional LSTM and a classification layer based on conditional random fields;
step 4.1, the ith sentence sequence in the text form is processed by utilizing the pre-training word vector M
Figure BDA00030928030700000611
Word level word vector form data converted into arbitrary dimension V
Figure BDA00030928030700000612
Wherein the content of the first and second substances,
Figure BDA00030928030700000613
represents the jth word
Figure BDA00030928030700000614
Word-level word vectors.
Step 4.2, processing data based on a word vector coding layer embedded by mixed coding, wherein the word vector coding layer is composed of bidirectional long-term and short-term memory network units;
step 4.2.1, in order to obtain the character level characteristic information of the word, the jth word sequence is processed
Figure BDA00030928030700000615
Each character in the character string is input into a bidirectional long and short term memory network unit at the character level and used for training the generation probability of generating corresponding words by all the characters;
step 4.2.2, extracting the jth word sequence
Figure BDA0003092803070000071
First word inThe character and the last character are used as the j word sequence after the output connection on the hidden layer in the bidirectional long and short term memory network unit
Figure BDA0003092803070000072
Character level vector of
Figure BDA0003092803070000073
Step 4.3, for the jth word
Figure BDA0003092803070000074
Word-level word vector of
Figure BDA0003092803070000075
And character level word vectors
Figure BDA0003092803070000076
Splicing to obtain the jth word sequence
Figure BDA0003092803070000077
Mixed coded word vector of
Figure BDA0003092803070000078
Thereby obtaining the ith sentence sequence SiWord vector of
Figure BDA0003092803070000079
And 4.4, processing data of the feature extraction layer based on the bidirectional LSTM:
in order to obtain the characteristic information of the whole context, the ith sentence sequence SiWord vector of
Figure BDA00030928030700000710
Inputting the sentence into a layer of forward LSTM network structure, and then inputting the ith sentence sequence SiWord vector of
Figure BDA00030928030700000711
Input to a layer of inverted LSTMIn the network structure, the jth word vector is finally added
Figure BDA00030928030700000712
Implicit layer state output in two LSTM networks
Figure BDA00030928030700000713
And
Figure BDA00030928030700000714
the combinations are spliced together as a word vector at the j' th position
Figure BDA00030928030700000715
Context feature information of
Figure BDA00030928030700000716
Thereby obtaining the ith sentence sequence SiCharacteristic sequence of
Figure BDA00030928030700000717
And 4.5, processing data of a classification layer based on the conditional random field:
4.5.1, constructing a classification layer of a trigger word detection task and a classification layer of a named entity recognition task, wherein the classification layer takes a conditional random field as a basic unit, so that the classification layer can well process the problem of label dependence; meanwhile, considering the relation that the trigger words and the entities can be correlated and promoted, parallel information conversion layer transform is constructed for the classification layer of the trigger word detection tasktd(ii) a Building parallel information conversion layer transform for classification layer of named entity recognition taskner
Step 4.5.2, feature sequence
Figure BDA00030928030700000718
Inputting the input into a classification layer of a trigger word detection task to obtain an output OtdThen output OtdInput to information transformationtdObtain the characteristic information F of the trigger wordtdFinally, the feature information F is obtainedtdAnd the characteristic sequence HiAdding the entity integral characteristics to obtain an entity integral characteristic, inputting the entity integral characteristic into a classification layer of a named entity identification task to obtain a final entity identification result
Figure BDA00030928030700000719
Step 4.5.3, feature sequence
Figure BDA00030928030700000720
Inputting the input into a classification layer of a named entity recognition task to obtain an output OnerThen output OnerInput to information transformationnerTo obtain the characteristic information F of the entitynerFinally, the feature information F is obtainednerAnd the characteristic sequence HiAdding the overall characteristics of the trigger words to obtain the overall characteristics of the trigger words, inputting the overall characteristics into a classification layer of a trigger word detection task, and obtaining a final trigger word recognition result
Figure BDA00030928030700000721
Step 5, training an MTL-TD-NER model to obtain an optimal trigger word detection and named entity recognition model:
step 5.1, setting parameter variables of the model:
setting the batch size B as 50 and the current starting iteration number as epochnow0, the maximum number of iterations is epochmax100, the number of iterations epoch for which the model difference loss does not decrease continuouslyno0, the maximum number of iterations for the early stop strategy is epochesIs 15;
step 5.2, parameter initialization:
initializing each parameter of a word vector layer based on mixed coding embedding, a feature extraction layer based on bidirectional LSTM and a classification layer based on conditional random field by adopting a uniform distribution method;
step 5.3, from the epochnowInitially, the training data set S is input into the MTL-TD-NER model in batches of size B each time, and the output labels of the model and the correctness in the training data set S are calculated using equation (1)The difference value loss of the labels to update the parameters in the model;
Figure BDA0003092803070000081
in formula (1), losstdAnd lossnerIs a loss function for triggering word detection and named entity recognition tasks and comprises:
Figure BDA0003092803070000082
Figure BDA0003092803070000083
in formulae (2) and (3), ytdAnd ynerA sequence of trigger word tags and named entity tags; score (y)td) And score (y)ner) Respectively the ith sentence sequence SiInputting scores of the trigger word label and the named entity label sequence output in the MTL-TD-NER model; λ and
Figure BDA0003092803070000084
the parameters are over parameters, and are all 1 at the moment, so that the importance degree between the two tasks is balanced;
Figure BDA0003092803070000085
representing all possible sets of trigger word tag sequences,
Figure BDA0003092803070000086
represents the set of all possible entity tag sequences,
Figure BDA0003092803070000087
to represent
Figure BDA0003092803070000088
One of which triggers a sequence of word tags,
Figure BDA0003092803070000089
a certain sequence of entity tags in the representation;
step 5.4, if epochnowLess than epochmaxAnd epochnoLess than epochesThen will epochnowAdding 1, and then continuing to execute the step 5.3; if epochnowGreater than or equal to epochmaxOr epochnoEqual to epochesThen, the optimal trigger word detection and named entity recognition network model based on the multi-task learning is obtained;
and 6, identifying the unlabeled data by using the optimal trigger word detection and named entity identification network model so as to obtain a trigger word label and a named entity identification label of each word in the unlabeled data.
The method for detecting the biomedical trigger words and identifying the named entities based on the multitask learning is provided. In the method, two different models are avoided being used for respectively carrying out trigger word detection and named entity recognition, and a multi-task learning framework is designed to simultaneously carry out two tasks. The MTL-TD-NER model is subjected to experiment on a data set to verify the effectiveness of the provided multi-task learning framework, and the provided multi-task learning framework algorithm is proved to have certain advantages in terms of trigger detection and named entity identification.

Claims (1)

1. A biomedical trigger word detection and named entity recognition method based on multitask learning is characterized by comprising the following steps:
step 1, preprocessing unstructured biomedical texts:
performing word segmentation and sentence segmentation on the unstructured biomedical text to obtain a tag-free training data set consisting of n sentence sequences, and recording the tag-free training data set as S ═ S1,S2,...,Si,...,Sn}; wherein S isiRepresents the ith sentence sequence, and
Figure FDA0003092803060000011
Figure FDA0003092803060000012
represents the jth word sequence in the ith sentence sequence, and
Figure FDA0003092803060000013
Figure FDA0003092803060000014
representing the ith sentence sequence SiThe j-th word sequence
Figure FDA0003092803060000015
The kth character of (1); n represents the total number of sentences in the training data set, m represents the total number of words in one sentence, and K represents the total number of characters in one word;
step 2, labeling the training data set S:
step 2.1, setting the categories of the trigger words and the named entity recognition respectively as
Figure FDA0003092803060000016
And
Figure FDA0003092803060000017
wherein the content of the first and second substances,
Figure FDA0003092803060000018
denotes the nth class trigger class, Lner nRepresenting the nth entity category;
step 2.2, the ith sentence sequence S in the training data set S is simultaneously processediLabels of the trigger word category and the named entity category are added to all word sequences, so that a labeled training data set of a trigger word detection task is obtained
Figure FDA0003092803060000019
Tagged training data set for recognition tasks with named entities
Figure FDA00030928030600000110
Wherein the content of the first and second substances,
Figure FDA00030928030600000111
representing the ith sentence sequence SiThe j-th word sequence
Figure FDA00030928030600000112
And its corresponding trigger word class
Figure FDA00030928030600000113
Figure FDA00030928030600000114
Representing the ith sentence sequence SiThe j-th word sequence
Figure FDA00030928030600000115
And its corresponding entity class
Figure FDA00030928030600000116
Step 3, word vector pre-training:
obtaining biomedical documents, performing word segmentation and sentence segmentation processing to obtain a label-free training data set consisting of n 'sentence sequences, and recording the label-free training data set as S ═ S'1,S′2,...,S′i′,...,S′n′};S′i′Representing the ith 'sentence sequence, and training a label-free training data set S' through a Word2Vec tool based on a language model to obtain a pre-training Word vector matrix M;
step 4, biomedical trigger word detection based on multitask learning and data processing of an MTL-TD-NER model of a named entity neural network; the MTL-TD-NER model consists of a word vector coding layer embedded based on hybrid coding, a feature extraction layer based on bidirectional LSTM and a classification layer based on conditional random fields;
step 4.1, the first step of text form is carried out by utilizing the pre-training word vector Mi sentence sequences
Figure FDA00030928030600000117
Word level word vector form data converted into arbitrary dimension V
Figure FDA0003092803060000021
Wherein the content of the first and second substances,
Figure FDA0003092803060000022
represents the jth word
Figure FDA0003092803060000023
Word-level word vectors of;
step 4.2, processing data based on a word vector coding layer embedded by mixed coding, wherein the word vector coding layer is composed of bidirectional long-short term memory network units;
step 4.2.1, the jth word sequence
Figure FDA0003092803060000024
Each character in the character string is input into a bidirectional long and short term memory network unit at the character level and used for training the generation probability of generating corresponding words by all the characters;
step 4.2.2, extracting the jth word sequence
Figure FDA0003092803060000025
The first character and the last character in the bidirectional long and short term memory network unit are used as the j word sequence after being connected on the output of the hidden layer in the bidirectional long and short term memory network unit
Figure FDA0003092803060000026
Character level vector of
Figure FDA0003092803060000027
Step 4.3, for the jth word
Figure FDA0003092803060000028
Word-level word vector of
Figure FDA0003092803060000029
And character level word vectors
Figure FDA00030928030600000210
Splicing to obtain the jth word sequence
Figure FDA00030928030600000211
Mixed coded word vector of
Figure FDA00030928030600000212
Thereby obtaining the ith sentence sequence SiWord vector of
Figure FDA00030928030600000213
And 4.4, processing data of the feature extraction layer based on the bidirectional LSTM:
the ith sentence sequence SiWord vector of
Figure FDA00030928030600000214
Inputting the sentence into a layer of forward LSTM network structure, and then inputting the ith sentence sequence SiWord vector of
Figure FDA00030928030600000215
Inputting into a layer of reverse LSTM network structure, and finally, inputting the jth word vector
Figure FDA00030928030600000216
Implicit layer state output in two LSTM networks
Figure FDA00030928030600000217
And
Figure FDA00030928030600000218
the combinations are spliced together as a word vector at the j' th position
Figure FDA00030928030600000219
Context feature information of
Figure FDA00030928030600000220
Thereby obtaining the ith sentence sequence SiCharacteristic sequence of
Figure FDA00030928030600000221
And 4.5, processing data of a classification layer based on the conditional random field:
step 4.5.1, constructing a classification layer of a trigger word detection task and a classification layer of a named entity recognition task, wherein the classification layer takes a conditional random field as a basic unit; building parallel information conversion layer transform for classification layer of trigger word detection tasktd(ii) a Building parallel information conversion layer transform for classification layer of named entity recognition taskner
Step 4.5.2, feature sequence
Figure FDA00030928030600000222
Inputting the input into a classification layer of a trigger word detection task to obtain an output OtdThen output OtdInput to information transformationtdObtain the characteristic information F of the trigger wordtdFinally, the feature information F is obtainedtdAnd the characteristic sequence HiAdding the entity integral characteristics to obtain an entity integral characteristic, inputting the entity integral characteristic into a classification layer of a named entity identification task to obtain a final entity identification result
Figure FDA0003092803060000031
Step 4.5.3, feature sequence
Figure FDA0003092803060000032
Input to namingIn the classification layer of the entity recognition task, an output O is obtainednerThen output OnerInput to information transformationnerTo obtain the characteristic information F of the entitynerFinally, the feature information F is obtainednerAnd the characteristic sequence HiAdding the overall characteristics of the trigger words to obtain the overall characteristics of the trigger words, inputting the overall characteristics into a classification layer of a trigger word detection task, and obtaining a final trigger word recognition result
Figure FDA0003092803060000033
Step 5, training an MTL-TD-NER model to obtain an optimal trigger word detection and named entity recognition model:
step 5.1, setting parameter variables of the model:
the batch size is B, the current number of iterations is epochnowThe maximum number of iterations is epochmaxThe number of iterations in which the difference value loss of the model does not decrease continuously is epochnoThe maximum number of iterations for the early-stop strategy is epoches
Step 5.2, parameter initialization:
initializing each parameter of a word vector layer based on mixed coding embedding, a feature extraction layer based on bidirectional LSTM and a classification layer based on conditional random field by adopting a uniform distribution method;
step 5.3, from the epochnowStarting, inputting batches with the size of B in a training data set S into an MTL-TD-NER model each time, and calculating the difference value loss of an output label of the model and a correct label in the training data set S by using the formula (1) so as to update parameters in the model;
Figure FDA0003092803060000034
in formula (1), losstdAnd lossnerIs a loss function for triggering word detection and named entity recognition tasks and comprises:
Figure FDA0003092803060000035
Figure FDA0003092803060000036
in formulae (2) and (3), ytdAnd ynerA sequence of trigger word tags and named entity tags; score (y)td) And score (y)ner) Respectively the ith sentence sequence SiInputting scores of the trigger word label and the named entity label sequence output in the MTL-TD-NER model; λ and
Figure FDA0003092803060000037
the system is a hyper-parameter and is used for balancing the importance degree between two tasks; (ii) a
Figure FDA0003092803060000038
Representing all possible sets of trigger word tag sequences,
Figure FDA0003092803060000039
represents the set of all possible entity tag sequences,
Figure FDA00030928030600000310
to represent
Figure FDA00030928030600000311
One of which triggers a sequence of word tags,
Figure FDA0003092803060000041
a certain sequence of entity tags in the representation;
step 5.4, if epochnowLess than epochmaxAnd epochnoLess than epochesThen will epochnowAdding 1, and then continuing to execute the step 5.3; if epochnowGreater than or equal to epochmaxOr epochnoEqual to epochesThen, the optimal trigger word detection and named entity recognition network model based on the multi-task learning is obtained;
and 6, identifying the unlabeled data by using the optimal trigger word detection and named entity identification network model so as to obtain a trigger word label and a named entity identification label of each word in the unlabeled data.
CN202110617440.1A 2021-05-31 2021-05-31 Biomedical trigger word detection and named entity identification method based on multi-task learning Active CN113360667B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110617440.1A CN113360667B (en) 2021-05-31 2021-05-31 Biomedical trigger word detection and named entity identification method based on multi-task learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110617440.1A CN113360667B (en) 2021-05-31 2021-05-31 Biomedical trigger word detection and named entity identification method based on multi-task learning

Publications (2)

Publication Number Publication Date
CN113360667A true CN113360667A (en) 2021-09-07
CN113360667B CN113360667B (en) 2022-07-26

Family

ID=77531522

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110617440.1A Active CN113360667B (en) 2021-05-31 2021-05-31 Biomedical trigger word detection and named entity identification method based on multi-task learning

Country Status (1)

Country Link
CN (1) CN113360667B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113553853A (en) * 2021-09-16 2021-10-26 南方电网数字电网研究院有限公司 Named entity recognition method and device, computer equipment and storage medium
CN114580422A (en) * 2022-03-14 2022-06-03 昆明理工大学 Named entity identification method combining two-stage classification of neighbor analysis

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104965819A (en) * 2015-07-12 2015-10-07 大连理工大学 Biomedical event trigger word identification method based on syntactic word vector
CN105512209A (en) * 2015-11-28 2016-04-20 大连理工大学 Biomedicine event trigger word identification method based on characteristic automatic learning
CN108628970A (en) * 2018-04-17 2018-10-09 大连理工大学 A kind of biomedical event joint abstracting method based on new marking mode
CN111222318A (en) * 2019-11-19 2020-06-02 陈一飞 Trigger word recognition method based on two-channel bidirectional LSTM-CRF network
WO2020193966A1 (en) * 2019-03-26 2020-10-01 Benevolentai Technology Limited Name entity recognition with deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104965819A (en) * 2015-07-12 2015-10-07 大连理工大学 Biomedical event trigger word identification method based on syntactic word vector
CN105512209A (en) * 2015-11-28 2016-04-20 大连理工大学 Biomedicine event trigger word identification method based on characteristic automatic learning
CN108628970A (en) * 2018-04-17 2018-10-09 大连理工大学 A kind of biomedical event joint abstracting method based on new marking mode
WO2020193966A1 (en) * 2019-03-26 2020-10-01 Benevolentai Technology Limited Name entity recognition with deep learning
CN111222318A (en) * 2019-11-19 2020-06-02 陈一飞 Trigger word recognition method based on two-channel bidirectional LSTM-CRF network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
YAN WANG 等: "Biomedical event trigger detection based on bidirectional LSTM and CRF", 《2017 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM)》 *
YANSEN SU 等: "EMODMI: A Multi-Objective Optimization Based Method to Identify Disease Modules", 《 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE》 *
何馨宇: "基于文本挖掘的生物事件抽取关键问题研究", 《中国博士学位论文全文数据库》 *
苏延森 等: "水下爬行机器人多目标路径规划的研究", 《合肥工业大学学报(自然科学版)》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113553853A (en) * 2021-09-16 2021-10-26 南方电网数字电网研究院有限公司 Named entity recognition method and device, computer equipment and storage medium
CN114580422A (en) * 2022-03-14 2022-06-03 昆明理工大学 Named entity identification method combining two-stage classification of neighbor analysis
CN114580422B (en) * 2022-03-14 2022-12-13 昆明理工大学 Named entity identification method combining two-stage classification of neighbor analysis

Also Published As

Publication number Publication date
CN113360667B (en) 2022-07-26

Similar Documents

Publication Publication Date Title
CN111444726B (en) Chinese semantic information extraction method and device based on long-short-term memory network of bidirectional lattice structure
CN110502749B (en) Text relation extraction method based on double-layer attention mechanism and bidirectional GRU
CN108984526B (en) Document theme vector extraction method based on deep learning
Yao et al. An improved LSTM structure for natural language processing
CN112115238B (en) Question-answering method and system based on BERT and knowledge base
CN112541356B (en) Method and system for recognizing biomedical named entities
CN108388560A (en) GRU-CRF meeting title recognition methods based on language model
CN114943230B (en) Method for linking entities in Chinese specific field by fusing common sense knowledge
CN111950283B (en) Chinese word segmentation and named entity recognition system for large-scale medical text mining
CN111738007A (en) Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network
CN113360667B (en) Biomedical trigger word detection and named entity identification method based on multi-task learning
Gao et al. Named entity recognition method of Chinese EMR based on BERT-BiLSTM-CRF
CN112784604A (en) Entity linking method based on entity boundary network
CN111914556A (en) Emotion guiding method and system based on emotion semantic transfer map
CN114676255A (en) Text processing method, device, equipment, storage medium and computer program product
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
Xu et al. Sentence segmentation for classical Chinese based on LSTM with radical embedding
CN113094502A (en) Multi-granularity takeaway user comment sentiment analysis method
CN112800184A (en) Short text comment emotion analysis method based on Target-Aspect-Opinion joint extraction
CN115545021A (en) Clinical term identification method and device based on deep learning
CN112101014A (en) Chinese chemical industry document word segmentation method based on mixed feature fusion
CN115169349A (en) Chinese electronic resume named entity recognition method based on ALBERT
Liu et al. Improved Chinese sentence semantic similarity calculation method based on multi-feature fusion
CN114564953A (en) Emotion target extraction model based on multiple word embedding fusion and attention mechanism
CN116522165B (en) Public opinion text matching system and method based on twin structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant