CN114519355A - Medicine named entity recognition and entity standardization method - Google Patents

Medicine named entity recognition and entity standardization method Download PDF

Info

Publication number
CN114519355A
CN114519355A CN202210017353.7A CN202210017353A CN114519355A CN 114519355 A CN114519355 A CN 114519355A CN 202210017353 A CN202210017353 A CN 202210017353A CN 114519355 A CN114519355 A CN 114519355A
Authority
CN
China
Prior art keywords
word
subset
training
matrix
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210017353.7A
Other languages
Chinese (zh)
Inventor
金冉
余同瑞
侯腾达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Wanli University
Original Assignee
Zhejiang Wanli University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Wanli University filed Critical Zhejiang Wanli University
Publication of CN114519355A publication Critical patent/CN114519355A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a medicine named entity recognition and entity standardization method, which comprises the steps of performing character feature extraction on input data by using CNN (CNN), acquiring context-related word vectors by using ELMo, and the embedded matrix of biometrically pre-trained words is input into a BLSTM-CNN-CRF based deep learning model, the output labels are fed back to update task parameters, the mutual support of a medicine named entity recognition task DNER and an entity standardization task DNEN is realized, the invention adopts a completely shared mode, the BiLSTM-CNN layer is shared between tasks, which means that, except for the corresponding output layers for DNER and DNEN settings, all parameters of the BLSTM-CNN-CRF based deep learning model are shared, the structure ensures that the model can capture the characteristic representation of different tasks and feed back each other to generate a prediction sequence; the method has the advantages that mutual support of the two tasks of medicine named entity identification and entity standardization is realized, and the accuracy of identifying the entity name and the entity boundary is higher.

Description

Medicine named entity recognition and entity standardization method
Technical Field
The invention relates to a medicine named entity recognition and entity standardization technology, in particular to a medicine named entity recognition and entity standardization method.
Background
With the rapid development of biomedicine, the exponential growth of related literature data causes a great deal of medicine information to be difficult to extract manually, but the entity information hidden in the medicine information is important for the research and application of biomedicine, and in order to fully utilize medicine texts, it is necessary to accurately capture the entity information contained in the texts.
With the rapid updating and development of medical knowledge, the manually made dictionary is difficult to meet the actual requirements, and the cost of manual labeling is too high due to the characteristics of unstructured and specialized medical texts. In order to solve the problem, the current mainstream method is to build a model capable of realizing medicine named entity identification and normalization by a machine learning method. However, most machine learning-based model frameworks are limited by the complexity of medical languages and insufficient extraction of text information, and are simple modeling only for Drug-Named Entity Recognition (DNER) and Drug-Named Entity Normalization (DNEN), and mutual support of the two tasks of Drug-Named Entity Recognition and Entity Normalization cannot be achieved, so that the accuracy of recognizing Entity names and Entity boundaries is not high.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a medicine named entity identification and entity standardization method with higher accuracy for identifying entity names and entity boundaries.
The technical scheme adopted by the invention for solving the technical problems is as follows: a medicine named entity recognition and entity normalization method comprises the following steps:
step 1, acquiring data from DDI2011 and DDI2013 challenge corpus to construct a data set for training a deep learning model, and using the data set for training the deep learning modelThe data set of the training deep learning model is subjected to preliminary processing in the following way: the data set is randomly equally divided into T subsets D1,D2,…,DTT is an integer of 2 or more; and respectively establishing four alphabets of words, characters char, labels and feature features for each subset, wherein each alphabet is used for storing { key: instance, value: index, where key represents a stored key, value represents a stored value, instance refers to a word, and index refers to an index; wherein, the label is a symbol for marking a word by adopting a BIOES marking method, in the BIOES marking method, B represents that the word is positioned at the beginning (Begin) of an entity, I represents inner (inside), O represents outer (out), E represents that the word is positioned at the end position of the entity, and S represents that the word can form an entity (Single) by itself;
step 2, respectively establishing two lists for each subset based on the four alphabets of each subset, wherein the two lists respectively comprise four columns of data, the four columns of data of the first list are [ words, char, labels, features ] in sequence, the four columns of data of the second list are [ words _ Ids, char _ Ids, label _ Ids, features _ Ids ] in sequence, in the first list of any subset, words correspond to data in word alphabets of the subset, char corresponds to data in character char alphabets of the subset, labels correspond to data in label alphabets of the subset, features correspond to data in feature alphabet of the subset, in the second list of any subset, words _ Ids records positions of data in word alphabets of the first list of the subset, and char _ Ids records positions of data in char alphabets of the first list of the subset, labelsjds records the position of each data in the labelstrom column of the first list of the subset in the corresponding label alphabet, and features _ Ids records the position of each data in the features column of the first list of the subset in the corresponding feature alphabet;
step 3, respectively carrying out data processing on the T subsets in the step 1 to obtain T training subsets, wherein the specific processing process is as follows: traversing all words in four alphabets of word, character char, label and feature of a certain subset, respectively establishing an embedded matrix of each word in each alphabet, pre-training sentences in a PMC (PermaCare machine center) biomedical corpus and a PubMed biomedical corpus by using a Glove algorithm to obtain word vectors, performing assignment updating on the embedded matrix of each word in the word alphabet by using the obtained word vectors, performing assignment updating on the embedded matrix of each word in the character char alphabet, label and feature alphabet randomly, and performing reverse updating on corresponding data in the subset by using the updated embedded matrix of each word in the four alphabets of the subset to obtain a training subset corresponding to the subset;
step 4, extracting sentences from each training subset in batches, and randomly extracting BatchSize sentences from each batch until all sentences in the training subset are extracted, wherein if the number of the remaining sentences not extracted in the training subset is less than 10, all the remaining sentences are extracted as the last batch of the training subset, the BatchSize is equal to 10, and the corresponding information of each batch of sentences is obtained from the first list of the subset corresponding to the training subset, modifying each batch of sentences to make each batch of sentences be nested lists formed by [ words, chars, labels, features ], and then processing each batch of sentences obtained at this time in the following way to make one batch of sentences as a group of training data, specifically: arranging sentences in a certain batch in a descending order according to the sequence length of the sentences, then determining the length of the longest sentence in the batch, adding 0 to the last of other sentences with the length less than the length of the longest sentence in the batch to enable the lengths of the sentences to be equal to the length of the longest sentence in the batch, finally respectively determining the length of the longest word in each sentence in the batch, adding 0 to the last of other words with the length less than the length of the longest word in the sentence to enable the lengths of the other words to be equal to the length of the longest word in the sentence, and obtaining a batch of sentences which are processed at this time as a set of training data; therefore, each training subset correspondingly obtains a plurality of groups of training data;
step 5, constructing a deep learning model based on BLSTM-CNN-CRF, wherein the deep learning model based on BLSTM-CNN-CRF sequentially comprises three layers: the character layer is provided with a convolutional neural network CNN and a Dropout algorithm, the character layer supports two modes of pre-training and random initialization, the character layer obtains word-level word embedding matrixes of all data input into the character layer through the two modes of pre-training or random initialization, obtains the word-level word embedding matrixes and labels of all data through the convolutional neural network, and finally obtains the word-level word embedding matrixes, the character-level word embedding matrixes and the label splicing combination to input the word-level word embedding matrixes into the Bilstm layer, and the Dropout algorithm is used for preventing overfitting in the working process of the convolutional neural network; the BilSTM layer is used for training and inputting an embedding matrix of each word in the embedding matrix of the spliced words in the BilSTM layer and predicting a label, capturing hidden characteristic information of a context in the embedding matrix of each word in the embedding matrix of the spliced words input in the BilSTM layer, obtaining a hidden state sequence mapping matrix based on the hidden characteristic information and then inputting the hidden state sequence mapping matrix into the CRF layer, and the CRF layer is used for obtaining a corresponding medicine named entity recognition and entity normalization result based on the hidden state sequence mapping matrix;
step 6, sequentially and respectively sending the multiple groups of training data obtained in the step 4 into a BLSTM-CNN-CRF-based deep learning model, training the BLSTM-CNN-CRF-based deep learning model, and taking the BLSTM-CNN-CRF-based deep learning model with the best prediction effect as a BLSTM-CNN-CRF-based deep learning model after training;
and 7, taking the data set to be predicted as a subset, converting the subset data into a data format matched with the BLSTM-CNN-CRF-based deep learning model according to the same method as the steps 1-4, namely the data format identical to each group of training data, and then sending the converted subset into the trained BLSTM-CNN-CRF-based deep learning model for prediction to obtain the named entity identification and entity normalization result corresponding to the data set to be predicted.
The BilSTM layer consists of a forward LSTM and a backward LSTM, and the BiLSTM layer consists of a forward LSTM and a backward LSTMThe forward LSTM and the backward LSTM are respectively formed by combining a plurality of LSTM units, wherein the number of the LSTM units in the forward LSTM and the backward LSTM is equal to the total number of the embedding matrixes of the independent words in the embedding matrixes of the spliced words input into the BiLSTM layer; in the forward LSTM, the embedded matrix of the spliced word of the BiLSTM layer is input to determine the position of the embedded matrix of each independent word in the embedded matrix according to the sequence from left to right, and at the moment, the embedded matrix x of the word at the position t in the embedded matrix of the spliced wordtEmbedding matrix x of words coming into the t-th LSTM cell, t + 1-th positiont+1Transmitting the input data into the t +1 th LSTM unit, wherein t represents the position of the embedded matrix of a word counted as t when the embedded matrix of the spliced word is counted sequentially from left to right, and repeating the steps to record the output of the t-th LSTM unit as
Figure BDA0003460314650000041
Is xtWhile except the 1 st LSTM unit, the t-th LSTM unit inputs the embedded matrix x of the words except the t positiontIn addition, there is the output of the t-1 LSTM unit
Figure BDA0003460314650000042
Is xt-1Each LSTM unit in the forward LSTM performs the determination of three pieces of information: judging the information quantity to be discarded, judging the new information quantity added into the current state and determining the information quantity of a final output part, and then finally outputting and inputting the hidden feature information of the embedded matrix of the word at the corresponding position in the hidden feature information to obtain the hidden feature information of the embedded matrix of the spliced word from left to right; the backward LSTM is used for obtaining hidden characteristic information of the embedding matrix of the spliced word input into the BilSTM layer from right to left, the method for obtaining the hidden characteristic information of the embedding matrix of the spliced word from right to left by the backward LSTM is the same as that of the forward LSTM, the difference is that in the backward LSTM, the embedding matrix of the spliced word determines the position of the embedding matrix of each independent word in the embedding matrix according to the sequence from right to left, and the t-th position word obtained in the backward LSTMThe hidden characteristic information of the embedded matrix is recorded as
Figure BDA0003460314650000043
Will be provided with
Figure BDA0003460314650000044
As hidden characteristic information of the t position of the embedded matrix of the spliced word, splicing the hidden characteristic information from the 1 st position to the last position of the embedded matrix of the spliced word to obtain a complete hidden state sequence of the embedded matrix of the spliced word; then mapping the complete hidden state sequence of the embedded matrix of the spliced word to obtain a matrix P ═ (P)1,p2,...,pn) Wherein p isjRepresenting the fraction of a label corresponding to the jth word in an embedded matrix of the spliced word, wherein n represents the number of words in the embedded matrix of the spliced word, j is 1,2, …, n, then inputting the matrix P into a CRF layer, judging whether the marking information of adjacent words in the set of training data is reasonable or not by the CRF layer based on the matrix P, selecting an optimal path and finally obtaining a named entity recognition result of each word in the embedded matrix of the spliced word, and further obtaining a named entity recognition result of the embedded matrix of the spliced word; after the named entity recognition results of the embedded matrixes of the spliced words corresponding to each group of training data are obtained by adopting the method, the named entity recognition results of the embedded matrixes of the spliced words corresponding to all groups of training data corresponding to each training subset are combined to be used as the named entity recognition results of the training subsets, and labels corresponding to all words in the named entity recognition results of the training subsets form entity standardization results of the training subsets; then using the formula
Figure BDA0003460314650000045
Updating each subset, wherein
Figure BDA0003460314650000046
To the updated ith subset, DiRepresents the ith subset, i ═ 1,2, …, T,
Figure BDA0003460314650000047
is as followsEntity normalization results, U, of training subsets corresponding to the i subsetsiIs a matrix and a symbol formed by the optimal paths selected in a CRF layer when all groups of training data of the training subset corresponding to the ith subset are processed by the deep learning model based on the BLSTM-CNN-CRF
Figure BDA0003460314650000051
Representing a Hadamard product operation; obtained at this time
Figure BDA0003460314650000052
To
Figure BDA0003460314650000053
Processing according to the methods of the steps 1-4 to obtain a plurality of groups of training data, then respectively inputting the obtained training data into the BLSTM-CNN-CRF-based deep learning model again for processing, finally obtaining the updated medicine named entity recognition and entity normalized prediction results corresponding to each subset, and storing the BLSTM-CNN-CRF-based deep learning model with the best prediction effect as the trained BLSTM-CNN-CRF-based deep learning model.
Compared with the prior art, the invention has the advantages that the character feature extraction is carried out on input data by using CNN, word vectors related to context are obtained by using ELMo, embedded matrixes of words pre-trained on biomedicine are input into a deep learning model based on BLSTM-CNN-CRF, output labels are fed back to update task parameters mutually, and the mutual support of a medicine named entity recognition task DNER and an entity standardization task DNEN is realized, the invention adopts a completely shared mode, and shares a BilSTM-CNN layer among tasks, which means that all parameters of the deep learning model based on BLSTM-CNN-CRF are shared except for output layers corresponding to the DNER and the DNEN, and the structure ensures that the model can capture feature representations of different tasks and feed back mutually to generate a prediction sequence, so that the invention realizes the mutual support of the two tasks of the medicine named entity recognition and the entity standardization, the accuracy of identifying the entity name and the entity boundary is high, and experimental results show that the method has better performance on DDI2011 and DDI2013 data sets.
Drawings
FIG. 1 is a comparison of the medical named entity recognition and entity normalization method of the present invention with other different methods.
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
Example (b): a medicine named entity recognition and entity normalization method comprises the following steps:
step 1, acquiring data from DDI2011 and DDI2013 challenge corpora to construct a data set for training a deep learning model, and performing primary processing on the data set for training the deep learning model according to the following modes: the data set is randomly equally divided into T subsets D1,D2,…,DTT is an integer of 2 or more; and respectively establishing four alphabets of words, characters char, labels and feature features for each subset, wherein each alphabet is used for storing { key: instance, value: index, where key represents a stored key, value represents a stored value, instance refers to a word, and index refers to an index; wherein, the label is a symbol for marking a word by adopting a BIOES marking method, in the BIOES marking method, B represents that the word is positioned at the beginning (Begin) of an entity, I represents inner (inside), O represents outer (out), E represents that the word is positioned at the end position of the entity, and S represents that the word can form an entity (Single) by itself;
step 2, respectively establishing two lists for each subset based on the four alphabets of each subset, wherein the two lists respectively contain four rows of data, the four rows of data of the first list are [ words, char, labels, features ], the four rows of data of the second list are [ words _ Ids, char _ Ids, labels _ Ids, features _ Ids ], in the first list of any subset, words correspond to data in the word alphabet of the subset, char corresponds to data in the character char alphabet of the subset, labels corresponds to data in the label alphabet of the subset, features correspond to data in the feature alphabet of the subset, in the second list of any subset, words _ Ids records the positions of the data in the word columns of the first list of the subset in the corresponding word alphabet, and the first list of the subset records the positions of the data in the character char in the first list of the subset, labelsjds records the position of each data in the labelstrom column of the first list of the subset in the corresponding label alphabet, and features _ Ids records the position of each data in the features column of the first list of the subset in the corresponding feature alphabet;
step 3, respectively carrying out data processing on the T subsets in the step 1 to obtain T training subsets, wherein the specific processing process is as follows: traversing all words in four alphabets of word, character char, label and feature of a certain subset, respectively establishing an embedded matrix of each word in each alphabet, pre-training sentences in a PMC (PermaCare machine center) biomedical corpus and a PubMed biomedical corpus by using a Glove algorithm to obtain word vectors, performing assignment updating on the embedded matrix of each word in the word alphabet by using the obtained word vectors, performing assignment updating on the embedded matrix of each word in the character char alphabet, label and feature alphabet randomly, and performing reverse updating on corresponding data in the subset by using the updated embedded matrix of each word in the four alphabets of the subset to obtain a training subset corresponding to the subset;
step 4, extracting sentences from each training subset in batches, and randomly extracting BatchSize sentences from each batch until all sentences in the training subset are extracted, wherein if the number of the remaining sentences not extracted in the training subset is less than 10, all the remaining sentences are extracted as the last batch of the training subset, the BatchSize is equal to 10, and the corresponding information of each batch of sentences is obtained from the first list of the subset corresponding to the training subset, modifying each batch of sentences to make each batch of sentences be nested lists formed by [ words, chars, labels, features ], and then processing each batch of sentences obtained at this time in the following way to make one batch of sentences as a group of training data, specifically: arranging sentences in a certain batch in a descending order according to the sequence length of the sentences, then determining the length of the longest sentence in the batch, adding 0 to the last of other sentences with the length less than the length of the longest sentence in the batch to enable the lengths of the sentences to be equal to the length of the longest sentence in the batch, finally respectively determining the length of the longest word in each sentence in the batch, adding 0 to the last of other words with the length less than the length of the longest word in the sentence to enable the lengths of the other words to be equal to the length of the longest word in the sentence, and obtaining a batch of sentences which are processed at this time as a set of training data; therefore, each training subset correspondingly obtains a plurality of groups of training data;
step 5, constructing a deep learning model based on BLSTM-CNN-CRF, wherein the deep learning model based on BLSTM-CNN-CRF sequentially comprises three layers: the character layer is provided with a convolutional neural network CNN and a Dropout algorithm, the character layer supports two modes of pre-training and random initialization, the character layer obtains word-level word embedding matrixes of all data input into the character layer through the two modes of pre-training or random initialization, obtains character-level word embedding matrixes and labels of all data through the convolutional neural network, and finally obtains embedding matrixes of word-level words, character-level word embedding matrixes and label splicing combination and then inputs the obtained embedding matrixes of spliced words into the BilStm layer, and the Dropout algorithm is used for preventing overfitting in the working process of the convolutional neural network; the BilSTM layer is used for training and inputting an embedding matrix of each word in the embedding matrix of the spliced words in the BilSTM layer and predicting a label, capturing hidden characteristic information of a context in the embedding matrix of each word in the embedding matrix of the spliced words input in the BilSTM layer, obtaining a hidden state sequence mapping matrix based on the hidden characteristic information and then inputting the hidden state sequence mapping matrix into the CRF layer, and the CRF layer is used for obtaining a corresponding medicine named entity recognition and entity normalization result based on the hidden state sequence mapping matrix;
step 6, sequentially and respectively sending the multiple groups of training data obtained in the step 4 into a BLSTM-CNN-CRF-based deep learning model, training the BLSTM-CNN-CRF-based deep learning model, and taking the BLSTM-CNN-CRF-based deep learning model with the best prediction effect as a BLSTM-CNN-CRF-based deep learning model after training;
and 7, taking the data set to be predicted as a subset, converting the subset data into a data format matched with the BLSTM-CNN-CRF-based deep learning model according to the same method as the steps 1-4, namely the data format identical to each group of training data, and then sending the converted subset into the trained BLSTM-CNN-CRF-based deep learning model for prediction to obtain the named entity identification and entity normalization result corresponding to the data set to be predicted.
In this embodiment, the BiLSTM layer is composed of a forward LSTM and a backward LSTM, the forward LSTM and the backward LSTM are respectively composed of a plurality of LSTM units, wherein the number of LSTM units in the forward LSTM and the backward LSTM is equal to the total number of embedded matrices of independent words in an embedded matrix of a concatenated word input into the BiLSTM layer; in the forward LSTM, the embedded matrix of the spliced word of the BiLSTM layer is input to determine the position of the embedded matrix of each independent word in the embedded matrix according to the sequence from left to right, and at the moment, the embedded matrix x of the word at the position t in the embedded matrix of the spliced wordtEmbedding matrix x of words coming into the t-th LSTM cell, t + 1-th positiont+1Transmitting the input data into the t +1 th LSTM unit, wherein t represents the position of the embedding matrix of a word counted as t when the embedding matrix of the spliced word is counted from left to right, and repeating the steps until the output of the t-th LSTM unit is recorded as
Figure BDA0003460314650000081
Figure BDA0003460314650000082
Is xtWhile except the 1 st LSTM unit, the t-th LSTM unit inputs the embedded matrix x of the words except the t positiontIn addition, there is the output of the t-1 LSTM unit
Figure BDA0003460314650000083
Is xt-1Each LSTM unit in the forward LSTM performs the determination of three pieces of information: judging the information quantity to be discarded, judging the new information quantity added into the current state and determining the information quantity of a final output part, and then finally outputting and inputting the hidden feature information of the embedded matrix of the word at the corresponding position in the hidden feature information to obtain the hidden feature information of the embedded matrix of the spliced word from left to right; the backward LSTM is used for obtaining hidden characteristic information of an embedding matrix of the spliced word input into the BilSTM layer from right to left, the method for obtaining the hidden characteristic information of the embedding matrix of the spliced word from right to left by the backward LSTM is the same as that of the forward LSTM, the difference is only that in the backward LSTM, the embedding matrix of the spliced word determines the position of the embedding matrix of each independent word in the embedding matrix according to the sequence from right to left, and the hidden characteristic information of the embedding matrix of the tth position word obtained in the backward LSTM is recorded as the hidden characteristic information of the embedding matrix of the tth position word
Figure BDA0003460314650000084
Will be provided with
Figure BDA0003460314650000085
As hidden characteristic information of the t position of the embedded matrix of the spliced word, splicing the hidden characteristic information from the 1 st position to the last position of the embedded matrix of the spliced word to obtain a complete hidden state sequence of the embedded matrix of the spliced word; then mapping the complete hidden state sequence of the embedded matrix of the spliced word to obtain a matrix P ═ (P)1,p2,...,pn) Wherein p isjRepresenting the fraction of a label corresponding to the jth word in an embedded matrix of the spliced word, wherein n represents the number of words in the embedded matrix of the spliced word, j is 1,2, …, n, then inputting the matrix P into a CRF layer, judging whether the marking information of adjacent words in the set of training data is reasonable or not by the CRF layer based on the matrix P, selecting an optimal path and finally obtaining a named entity recognition result of each word in the embedded matrix of the spliced word, and further obtaining a named entity recognition result of the embedded matrix of the spliced word; the method is adopted to obtain the naming entity of the embedded matrix of the spliced words corresponding to each group of training dataAfter the result of the body recognition, combining the named entity recognition results of the embedded matrixes of the spliced words corresponding to all groups of training data corresponding to each training subset as the named entity recognition result of the training subset, wherein the labels corresponding to all the words in the named entity recognition result of the training subset form the entity normalized result of the training subset; then using the formula
Figure BDA0003460314650000091
Updating each subset, wherein
Figure BDA0003460314650000092
To the updated ith subset, DiRepresents the ith subset, i ═ 1,2, …, T,
Figure BDA0003460314650000093
normalizing the results, U, for the entity of the training subset corresponding to the ith subsetiIs a matrix and a symbol formed by the optimal paths selected in a CRF layer when all groups of training data of the training subset corresponding to the ith subset are processed by the deep learning model based on the BLSTM-CNN-CRF
Figure BDA0003460314650000094
Representing a Hadamard product operation; obtained at this time
Figure BDA0003460314650000095
To
Figure BDA0003460314650000096
Processing according to the methods of the steps 1-4 to obtain a plurality of groups of training data, then respectively inputting the obtained training data into the BLSTM-CNN-CRF-based deep learning model again for processing, finally obtaining the updated medicine named entity recognition and entity normalized prediction results corresponding to each subset, and storing the BLSTM-CNN-CRF-based deep learning model with the best prediction effect as the trained BLSTM-CNN-CRF-based deep learning model.
To verify the performance of the present invention, we utilized the pytorch for deep learning model deployment and run on the Nvidia GTX1080 GPU. In our experiments, we adopted the DDI2011 challenge corpus from drug-drug interaction tasks. Extracting elements < presence > and < entity > by using a minidom module in python, acquiring necessary texts and entity information, establishing a list, and matching and labeling the entities and the texts. All training data sets are then collected as training data and all test data sets are collected as test data. Table 1 shows the distribution of documents, sentences, and drugs in the training and testing set in the DDI 2011. In this corpus, there is only one type of entity name: DRUG, therefore, the text is only labeled "B/I-DRUG" or "O".
Table 1 training and test setup in DDI2011
Figure BDA0003460314650000097
To further evaluate the performance of the deep learning model, we used the data set in the drug name recognition and classification task of SemEval-2013, table 2 shows the number of annotated entities in the DDI2013 training set and test set. The dataset contains four entity types: drug, Brand, Group, Drug _ n. Where Drug denotes any chemical agent used for the treatment, cure, prevention or diagnosis of a disease that has been approved for human use, branch is characterized by a trade or Brand name, Group denotes any term specifying a chemical or pharmacological relationship between a Group of drugs herein, and Drug _ n type describes a chemical agent that has not been approved for human medical purposes.
TABLE 2 number of annotated entities in DDI2013
Figure BDA0003460314650000101
The invention uses Glove to initialize the word embedding matrix, which is pre-trained on PMC and PubMed biomedical corpus and acquired by ELMoWord vectors that are related in the following, in terms of uniform samples for the embedded matrix of characters
Figure BDA0003460314650000102
Random initialization is performed, where dim is 30. Table 3 shows the hyper-parameters used in our experiments, we set dimensions of the embedding matrix based on pre-trained words, characters and the embedding matrix of contextualized characters to 30, 100 and 1024, we update the parameters using small batch Stochastic Gradient Descent (SGD) with decreasing learning rate during training, we set the initial learning rate of the deep learning model to 0.015, Dropout rate to 0.5, and batch size to 10.
TABLE 3. hyper-parameters in our experiments
Figure BDA0003460314650000103
Figure BDA0003460314650000111
In our experiments, we evaluated the performance of the deep learning model using precision, recall, F1, where precision represents the ratio of all entities predicted to be correct to the total predicted entities, recall represents the ratio of the predicted entities to the total number of entities in the dataset, and F1 represents the harmonic mean of precision and recall, and the formula is as follows:
Figure BDA0003460314650000112
Figure BDA0003460314650000113
Figure BDA0003460314650000114
where TP represents the number of true positive samples, TN represents the number of true negative samples, FP represents the number of false positives, and F can represent the number of false negative samples. We challenge two of the four evaluation criteria provided by the DDI2013, type matching (only if there is some overlap with the same class of gold drug name) and exact matching (only if the labeled boundary and class match the gold drug name exactly, the labeled drug name is correct).
The deep learning model is evaluated on DDI2011 and DDI2013 which are representative biomedical corpuses, the performance comparison of Multi-DTR and other team work is shown in table 4, and then the influence of each architecture in the model on the experiment, such as different embedded layers, different optimization methods, a Multi-task mutual feedback framework and the like, is analyzed. Through comparison, the deep learning model architecture can be found to have good performance in experiments.
We compare the results with those of other teams, and in order to ensure the fairness and rationality of the experiment, the hyper-parameters of the deep learning model are configured according to the optimal parameters in the article, as can be seen from table 4, the earlier proposed dictionary-based and rule-based methods achieve reasonable results, including Tsuruoka, Hettne, etc., and the subsequent deep learning model, such as LASIGE, etc., combines CRF with the dictionary term list collected from the database for DNER processing to identify and classify the entities, Zeng, etc., achieve the discrimination of pharmaceutical entities by using the BiLSTM-CRF structure without any external dictionary, and achieve better experimental results. Yang et al use a hierarchical recursive network for cross-language transfer learning. The Liu et al model combines word embedding trained in biomedical texts with semantic features of three medical dictionaries, and the model performs well on DDI2013, which shows that the accuracy of the model is 0.90% lower than that of the Liu model, but the recall rate is 6.23% higher and the recall rate of F1 is 2.43% higher.
TABLE 4 Experimental results in DDI2011 and DDI2013
Figure BDA0003460314650000121
When evaluating the DDI2013 data set, table 5 counts accurate evaluation of deep learning models for identifying various entity types in the DDI2013 corpus, and performs good performance in type identification, but because the entity type of Drug _ n is low in occupation ratio (< 4%) in the data set, the deep learning models ignore the difference between the entity and other entity types, resulting in lower model identification accuracy than other entities.
TABLE 5 Experimental results for different entity types in DDI2013
Figure BDA0003460314650000131
The invention provides that richer characteristic information is obtained through an embedded matrix of pre-trained words, character representation and an embedded matrix of words related to context, as shown in table 6, in order to test the influence of input information of different representations on a model, three kinds of embedded matrix information are respectively combined and input into a deep learning model, the result shows that the serial representation is superior to single representation, and the method of multiple representation realizes the best performance.
TABLE 6 comparison of the Properties of each representation
Figure BDA0003460314650000132
We compared different optimizers, including SGD, adarad, adapelta, RMSProp, and Adam, SGD avoids falling into a saddle point or a poor local optimum point by randomly extracting training samples of fixed size to compute the gradient and update parameters. Adagarad is the most learning rate, is subjected to a constraint, is suitable for processing sparse gradients, but may cause gradient disappearance, and Adadelta is an extension of Adagarad, and is computationally simplified. RMSProp relies on global learning rate, is suitable for handling non-stationary targets, Adam dynamically adjusts the learning rate of each parameter using first and second moment estimates of the gradient, but is prone to generalization and convergence problems. The experimental results are shown in fig. 1, and the SGD is obviously superior to other optimizers.
The invention evaluates the effectiveness of Dropout, all other hyper-parameters of the deep learning model are the same as those in table 3, as shown in table 8, we observe that after Dropout is used, the performance of the deep learning model in the DDI2011 and DDI2013 corpora is slightly improved, and the effect of Dropout in reducing overfitting is proved.
TABLE 8 comparison of Performance Using dropout layers
Figure BDA0003460314650000141
The invention also explores the effectiveness of a multi-task learning strategy, and as can be seen from table 9, the performance of the model is obviously improved by adopting two explicit feedback strategies to jointly model DNER and DNEN, so that on one hand, the general representation of the two tasks provided by multi-task learning is benefited, on the other hand, the method converts the hierarchical tasks into parallel multi-task setting, and meanwhile, the mutual support among the tasks is maintained.
TABLE 9 Performance comparison with multitask learning
Figure BDA0003460314650000142
Figure BDA0003460314650000151
In summary, the medicine text mining is an important interdisciplinary field of computer science and biomedicine, the medicine named entity recognition and entity standardization method has good performance on data sets DDI2011 and DDI2013, detailed analysis shows that main gains of a deep learning model in the invention come from shared characters among medicine entities, pre-trained word embedded matrixes and context-related word embedded matrix information, the problem of entity boundary and type conflict is basically solved through forward feedback of DNER and DNEN, experiments show that the method can obtain good performance without any medicine dictionary and any manually constructed function, and an efficient medicine entity recognition system is established.

Claims (2)

1. A medicine named entity recognition and entity normalization method is characterized by comprising the following steps:
step 1, acquiring data from DDI2011 and DDI2013 challenge corpora to construct a data set for training a deep learning model, and performing primary processing on the data set for training the deep learning model according to the following modes: the data set is randomly equally divided into T subsets D1,D2,…,DTT is an integer of 2 or more; and respectively establishing four alphabets of words, characters char, labels and feature features for each subset, wherein each alphabet is used for storing { key: instance, value: index, where key represents a stored key, value represents a stored value, instance refers to a word, and index refers to an index; wherein, the label is a symbol for marking a word by adopting a BIOES marking method, in the BIOES marking method, B represents that the word is positioned at the beginning (Begin) of an entity, I represents inner (inside), O represents outer (out), E represents that the word is positioned at the end position of the entity, and S represents that the word can form an entity (Single) by itself;
step 2, respectively establishing two lists for each subset based on the four alphabets of each subset, wherein the two lists respectively contain four rows of data, the four rows of data of the first list are [ words, char, labels, features ], the four rows of data of the second list are [ words _ Ids, char _ Ids, labels _ Ids, features _ Ids ], in the first list of any subset, words correspond to data in the word alphabet of the subset, char corresponds to data in the character char alphabet of the subset, labels corresponds to data in the label alphabet of the subset, features correspond to data in the feature alphabet of the subset, in the second list of any subset, words _ Ids records the positions of the data in the word columns of the first list of the subset in the corresponding word alphabet, and the first list of the subset records the positions of the data in the character char in the first list of the subset, labelsjds records the position of each data in the labelstrom column of the first list of the subset in the corresponding label alphabet, and features _ Ids records the position of each data in the features column of the first list of the subset in the corresponding feature alphabet;
step 3, respectively carrying out data processing on the T subsets in the step 1 to obtain T training subsets, wherein the specific processing process is as follows: traversing all words in four alphabets of word, character char, label and feature of a certain subset, respectively establishing an embedded matrix of each word in each alphabet, pre-training sentences in a PMC (PermaCare machine center) biomedical corpus and a PubMed biomedical corpus by using a Glove algorithm to obtain word vectors, performing assignment updating on the embedded matrix of each word in the word alphabet by using the obtained word vectors, performing assignment updating on the embedded matrix of each word in the character char alphabet, label and feature alphabet randomly, and performing reverse updating on corresponding data in the subset by using the updated embedded matrix of each word in the four alphabets of the subset to obtain a training subset corresponding to the subset;
step 4, extracting sentences in batches from each training subset, and extracting batchSize sentences in each batch randomly until all sentences in the training subset are extracted, wherein if the number of the unextracted remaining sentences in the training subset is less than 10, all the remaining sentences are extracted as the last batch of the training subset, the batchSize is equal to 10, the corresponding information of each batch of sentences is obtained from the first list of the subset corresponding to the training subset, each batch of sentences are modified to be nested lists formed by [ words, char, labels, features ], and then each batch of sentences obtained at this time are processed as follows, and then each batch of sentences are used as a group of training data, specifically: arranging sentences in a certain batch in a descending order according to the sequence length of the sentences, then determining the length of the longest sentence in the batch, adding 0 to the last of other sentences with the length less than the length of the longest sentence in the batch to enable the lengths of the sentences to be equal to the length of the longest sentence in the batch, finally respectively determining the length of the longest word in each sentence in the batch, adding 0 to the last of other words with the length less than the length of the longest word in the sentence to enable the lengths of the other words to be equal to the length of the longest word in the sentence, and obtaining a batch of sentences which are processed at this time as a set of training data; therefore, each training subset correspondingly obtains a plurality of groups of training data;
step 5, constructing a deep learning model based on BLSTM-CNN-CRF, wherein the deep learning model based on BLSTM-CNN-CRF sequentially comprises three layers: the character layer is provided with a convolutional neural network CNN and a Dropout algorithm, the character layer supports two modes of pre-training and random initialization, the character layer obtains word-level word embedding matrixes of all data input into the character layer through the two modes of pre-training or random initialization, obtains the word-level word embedding matrixes and labels of all data through the convolutional neural network, and finally obtains the word-level word embedding matrixes, the character-level word embedding matrixes and the label splicing combination to input the word-level word embedding matrixes into the Bilstm layer, and the Dropout algorithm is used for preventing overfitting in the working process of the convolutional neural network; the BilSTM layer is used for training and inputting an embedding matrix of each word in the embedding matrix of the spliced words in the BilSTM layer and predicting a label, capturing hidden characteristic information of a context in the embedding matrix of each word in the embedding matrix of the spliced words input in the BilSTM layer, obtaining a hidden state sequence mapping matrix based on the hidden characteristic information and then inputting the hidden state sequence mapping matrix into the CRF layer, and the CRF layer is used for obtaining a corresponding medicine named entity recognition and entity normalization result based on the hidden state sequence mapping matrix;
step 6, sequentially and respectively sending the multiple groups of training data obtained in the step 4 into a BLSTM-CNN-CRF-based deep learning model, training the BLSTM-CNN-CRF-based deep learning model, and taking the BLSTM-CNN-CRF-based deep learning model with the best prediction effect as a BLSTM-CNN-CRF-based deep learning model after training;
and 7, taking the data set to be predicted as a subset, converting the subset data into a data format matched with the BLSTM-CNN-CRF-based deep learning model according to the same method as the steps 1-4, namely the data format identical to each group of training data, and then sending the converted subset into the trained BLSTM-CNN-CRF-based deep learning model for prediction to obtain the named entity identification and entity normalization result corresponding to the data set to be predicted.
2. The method of claim 1, wherein the BilSTM layer is composed of forward LSTM and backward LSTM, and the forward LSTM and the backward LSTM are respectively composed of a plurality of LSTM units, wherein the number of LSTM units in the forward LSTM and the backward LSTM is equal to the total number of the embedding matrixes of the independent words in the embedding matrix of the concatenated words inputted into the BilSTM layer; in the forward LSTM, the embedded matrix of the spliced word of the BiLSTM layer is input to determine the position of the embedded matrix of each independent word in the embedded matrix according to the sequence from left to right, and at the moment, the embedded matrix x of the word at the position t in the embedded matrix of the spliced wordtEmbedding matrix x of words coming into the t-th LSTM cell, t + 1-th positiont+1Transmitting the input data into the t +1 th LSTM unit, wherein t represents the position of the embedding matrix of a word counted as t when the embedding matrix of the spliced word is counted from left to right, and repeating the steps until the output of the t-th LSTM unit is recorded as
Figure FDA0003460314640000031
Figure FDA0003460314640000032
Is xtWhile except the 1 st LSTM unit, the t-th LSTM unit inputs the embedded matrix x of the words except the t positiontIn addition, there is the output of the t-1 LSTM unit
Figure FDA0003460314640000033
Figure FDA0003460314640000034
Is xt-1Each LSTM unit in the forward LSTM performs the determination of three pieces of information: judging the information quantity to be discarded, judging the new information quantity added into the current state and determining the information quantity of a final output part, and then finally outputting and inputting the hidden feature information of the embedded matrix of the word at the corresponding position in the hidden feature information to obtain the hidden feature information of the embedded matrix of the spliced word from left to right; the backward LSTM is used for obtaining hidden characteristic information of an embedding matrix of the spliced word input into the BilSTM layer from right to left, the method for obtaining the hidden characteristic information of the embedding matrix of the spliced word from right to left by the backward LSTM is the same as that of the forward LSTM, the difference is only that in the backward LSTM, the embedding matrix of the spliced word determines the position of the embedding matrix of each independent word in the embedding matrix according to the sequence from right to left, and the hidden characteristic information of the embedding matrix of the tth position word obtained in the backward LSTM is recorded as the hidden characteristic information of the embedding matrix of the tth position word
Figure FDA0003460314640000035
Will be provided with
Figure FDA0003460314640000036
As hidden characteristic information of the t position of the embedded matrix of the spliced word, splicing the hidden characteristic information from the 1 st position to the last position of the embedded matrix of the spliced word to obtain a complete hidden state sequence of the embedded matrix of the spliced word; then mapping the complete hidden state sequence of the embedded matrix of the spliced word to obtain a matrix P ═ P (P)1,p2,...,pn) Wherein p isjRepresenting the fraction of a label corresponding to the jth word in an embedding matrix of the spliced word, wherein n represents the number of words in the embedding matrix of the spliced word, j is 1,2, …, n, then inputting a matrix P into a CRF layer, and the CRF layer judges whether the label information of adjacent words in the training data is judged based on the matrix PReasonably selecting an optimal path and finally obtaining a named entity recognition result of each word in the embedded matrix of the spliced words so as to obtain a named entity recognition result of the embedded matrix of the spliced words;
after the named entity recognition results of the embedded matrixes of the spliced words corresponding to each group of training data are obtained by adopting the method, the named entity recognition results of the embedded matrixes of the spliced words corresponding to all groups of training data corresponding to each training subset are combined to be used as the named entity recognition results of the training subsets, and labels corresponding to all words in the named entity recognition results of the training subsets form entity standardization results of the training subsets; then using the formula
Figure FDA0003460314640000041
Updating each subset, wherein
Figure FDA0003460314640000042
To the updated ith subset, DiRepresents the ith subset, i is 1,2, …, T,
Figure FDA0003460314640000043
for the entity normalization result of the training subset corresponding to the ith subset, UiIs a matrix and a symbol formed by the optimal paths selected in a CRF layer when all groups of training data of the training subset corresponding to the ith subset are processed by the deep learning model based on the BLSTM-CNN-CRF
Figure FDA0003460314640000044
Representing a Hadamard product operation; obtained at this time
Figure FDA0003460314640000045
To
Figure FDA0003460314640000046
Processing according to the method from step 1 to step 4 to obtain multiple groups of training data, and inputting the multiple groups of training data again based on the training dataAnd processing the BLSTM-CNN-CRF deep learning model to finally obtain updated medicine named entity identification and entity normalized prediction results corresponding to each subset, and saving the BLSTM-CNN-CRF deep learning model with the best prediction effect at the moment as a trained BLSTM-CNN-CRF deep learning model.
CN202210017353.7A 2021-08-25 2022-01-07 Medicine named entity recognition and entity standardization method Pending CN114519355A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2021109814541 2021-08-25
CN202110981454 2021-08-25

Publications (1)

Publication Number Publication Date
CN114519355A true CN114519355A (en) 2022-05-20

Family

ID=81596329

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210017353.7A Pending CN114519355A (en) 2021-08-25 2022-01-07 Medicine named entity recognition and entity standardization method

Country Status (1)

Country Link
CN (1) CN114519355A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117316372A (en) * 2023-11-30 2023-12-29 天津大学 Ear disease electronic medical record analysis method based on deep learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117316372A (en) * 2023-11-30 2023-12-29 天津大学 Ear disease electronic medical record analysis method based on deep learning
CN117316372B (en) * 2023-11-30 2024-04-09 天津大学 Ear disease electronic medical record analysis method based on deep learning

Similar Documents

Publication Publication Date Title
CN110210037B (en) Syndrome-oriented medical field category detection method
CN109446338B (en) Neural network-based drug disease relation classification method
Quan et al. Multichannel convolutional neural network for biological relation extraction
Maghari et al. Books’ rating prediction using just neural network
CN111554360A (en) Drug relocation prediction method based on biomedical literature and domain knowledge data
Dheeraj et al. Negative emotions detection on online mental-health related patients texts using the deep learning with MHA-BCNN model
Hossain et al. Bengali text document categorization based on very deep convolution neural network
Dewi et al. Drug-drug interaction relation extraction with deep convolutional neural networks
CN111950283B (en) Chinese word segmentation and named entity recognition system for large-scale medical text mining
Akhtyamova et al. Adverse drug extraction in twitter data using convolutional neural network
CN115019906B (en) Drug entity and interaction combined extraction method for multi-task sequence labeling
CN112347766A (en) Multi-label classification method for processing microblog text cognition distortion
CN111222318A (en) Trigger word recognition method based on two-channel bidirectional LSTM-CRF network
CN112149411B (en) Method for constructing body in clinical application field of antibiotics
CN111581974A (en) Biomedical entity identification method based on deep learning
CN115293161A (en) Reasonable medicine taking system and method based on natural language processing and medicine knowledge graph
Schäfer et al. UMLS mapping and Word embeddings for ICD code assignment using the MIMIC-III intensive care database
Wang et al. Automatic human-like mining and constructing reliable genetic association database with deep reinforcement learning
CN115713078A (en) Knowledge graph construction method and device, storage medium and electronic equipment
Mechti et al. A decision system for computational authors profiling: From machine learning to deep learning
CN114519355A (en) Medicine named entity recognition and entity standardization method
CN113539414A (en) Method and system for predicting rationality of antibiotic medication
Machado et al. Drug–drug interaction extraction‐based system: An natural language processing approach
Siddalingappa et al. Bi-directional long short term memory using recurrent neural network for biological entity recognition
Tran et al. Exploring a deep learning pipeline for the BioCreative VI precision medicine task

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination