CN111126040A

CN111126040A - Biomedical named entity identification method based on depth boundary combination

Info

Publication number: CN111126040A
Application number: CN201911362019.XA
Authority: CN
Inventors: 黄瑞章; 扈应; 秦永彬; 武乐飞; 陈艳平
Original assignee: Guizhou University
Current assignee: Guizhou University
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2020-05-08
Anticipated expiration: 2039-12-26
Also published as: CN111126040B

Abstract

The invention discloses a biomedicine named entity identification method based on depth boundary combination, which comprises the following steps: step one, modeling a discontinuous entity in a biomedical entity into a nested entity structure; step two, representing biomedical vocabulary information by using character level Embedding and word level Embedding; thirdly, recognizing the boundary of the biomedical entity by using a neural network model based on the word vector obtained in the second step; step four, using a boundary combination strategy to generate a candidate entity set; and fifthly, constructing a neural network classifier and screening the candidate entity set. Aiming at the characteristics of the biomedical named entity, the invention adopts a depth-boundary-based combined framework and combines available external resources, more accurately represents biomedical words, solves the problem of discontinuous entity identification in a biomedical text, completes a BioNER task, provides more powerful theoretical and technical support for the BioNER, further provides a convenient and efficient entity identification tool for researchers in the biomedical field, and effectively improves the performance of biomedical entity identification.

Description

Biomedical named entity identification method based on depth boundary combination

Technical Field

The invention relates to a biomedical named entity recognition method, in particular to a biomedical named entity recognition method based on depth boundary combination, and belongs to the technical field of natural language processing and machine learning.

Background

Currently, many people pay attention to the research on biomedicine capable of timely and effective prevention, change or treatment of diseases, and the social value and the commercial value are more prominent. Among them, many studies require a large investment and a long study period, and efficient retrieval of biomedical documents is an important means for securing the progress of the studies. However, a large amount of biomedical knowledge is stored in databases in the form of unstructured text. Statistically, PubMed central literature databases contain over 2900 tens of thousands of literature citations, covering almost all biomedical domain knowledge. Even if only a very specialized research field is focused on, most biologists have difficulty keeping up with the research progress in this field. Therefore, it becomes crucial to extract knowledge accurately from a large amount of literature. Biomedical text mining promises to achieve this goal, and in some cases can also reduce costs, providing timely access to needed knowledge and discovering explicit and implicit associations between knowledge.

Biomedical information extraction provides a content-oriented approach to processing biomedical documents, rather than simply ranking relevant biomedical documents by document similarity. Biomedical named Entity Recognition (BioNER) is one of the basic tasks of biomedical text mining, aims to recognize text blocks related to specific interested entities, and plays a key role in tasks such as disease treatment relationship extraction, gene function Recognition and the like. The named entity recognition task that is commonly referred to is to recognize corresponding names of people, places, tissues, etc. from text, however, in the biomedical field, entities such as DNA, RNA, proteins, etc. are of greater interest to biologists. BioNER is the first step in the processing of biomedical documents, and errors made during processing can lead to cascading errors that affect subsequent tasks such as relationship identification and event identification. In view of the important linguistic and semantic roles played by BioNER, its more efficient identification and classification will have great theoretical significance and practical value for biomedical research.

Compared with named entities in the general field, named entities in biomedicine (BioNEs) have the following characteristics: (1) BioNES have a number of pre-modifiers, such as Major Histocompatibility (MHC) class II genes (DNA), which make entity length variations large and entity boundaries difficult to determine. (2) There are many conjunctions or disjunctions in BioNES, i.e., two or more entity names share the same prefix (suffix) noun with the conjunctions or disjunctions. For example, the sentence human Tang natural killer cells includes two named entities, human T cell (cell _ type) and human natural killer cell (cell _ type). (3) The entity nesting phenomenon is wide. For example, among the entities Duffy antigen/chemikine receiver gene (DNA), Duffy antigen/chemikine receiver is also a type of protein that needs to be recognized. (4) In BioNES, there are many acronyms entities. These entities may also be ambiguous and not conducive to using the neural network model to obtain semantic information. For example, "TCF" may refer to both T-cell Factor (TCF) and Tissue Culture Fluid (Tissue Culture Fluid). These entities are also difficult to identify from existing dictionaries and require context to accurately infer entity type. (5) There is no strict naming convention in the biomedical literature and different representations of the same entity may exist. For example, Cholesterol, 5-Cholesten-3beta-ol and (3beta) -Cholesten-5-en-3-ol all represent the same chemical species. There have been many works for applying the existing named entity recognition methods in the general field directly to the biomedical field, however, the biomedical named entity recognition (Bio-NER) is still a challenging issue because the specificity of the named entities mentioned above rarely achieves satisfactory results. To this end, the present invention is intended to conduct studies on BioNER-related methods.

The Named Entity Recognition (NER) task is generally considered as a sequence tagging problem, in which each word in a sentence is assigned a corresponding tag (Begin of the entry (B), Inside of the entry (I), Out of the entry (O)), thereby representing its semantic information. After years of development, BioNER has gone through three major stages: dictionary-based methods, rule-based methods, machine learning-based methods.

The dictionary-based approach is to store all known named entities in a database, and use the database to perform simple, exact (or fuzzy) matching on the text. However, in contrast to the rapid growth of biomedical literature, it is not possible to build a database dictionary containing all the category entities. The rule-based approach matches named entities through artificially designed heuristic rules. Budi et al use rules consisting of grammar (e.g., part of speech), syntax (e.g., part of speech), and orthographic patterns (e.g., case) for named entity recognition. Fukuda et al used rules such as case, symbol, number, etc. to extract proteins. Etzioni et al propose a semi-supervised framework that divides the named entity recognition process into three steps: pattern learning, subclass extraction and list extraction. And automatically generating a new extraction rule by using the framework to complete the named entity identification task. However, the formulation of these rules requires a lot of manpower and material resources. The machine learning based approach has the advantage of automatically extracting decision boundaries from the annotation data. It is widely used to solve the NER problem. Typically, NER is considered a multi-classification task or a sequence labeling task. Many supervised algorithms are applied to the NER, such as Decision Trees (DT), Maximum Entropy (ME), Support Vector Machines (SVM), hidden markov (HMM), Conditional Random Fields (CRF). Using machine learning based algorithms, researchers do not have to manually write complex rules. In addition, the algorithms can also identify new named entities and categories which do not appear in the standard dictionary, and are widely applied to NER tasks.

In recent years, with the development of neural networks, Natural Language Processing (NLP) tasks have greater development potential, and deep neural networks have been applied to each NLP task and all have achieved great success. Compared with the traditional machine learning method based on artificial constructed features, the neural network can automatically extract high-order abstract features from the original input. It also has the advantage of organizing different layers (such as convolutional layers, recursive layers, pooling layers, and fully-connected layers) to implement complex nonlinear feature transformations. Many neural network models are applied to the NER task, such as Convolutional Neural Networks (CNN), long-short-term memory neural networks (LSTM), LSTM-CNNs, LSTM-CRF, etc. Gridach et al, in biomedical datasets (JNLPBA corpus and the BioCreAtivE IIGene Mention (GM) corpus), combine deep neural networks with CRF, word embedding representations and character level word representations, showing good performance in BioNER. However, these methods have little ability to identify the widely-occurring nested entities in BioNEs, which poses a significant obstacle to improving the performance of BioNER tasks.

There are fewer studies on named entity recognition with nested structures than studies on planar entity recognition. The earliest study on nested NEs was Alex et al, who compared three classical nested named entity recognition methods, hierarchical, cascading and federated. Based on the same dataset (GENIA corpus), Finkel et al uses a flatter parse tree to identify the nested NEs. In this model, rules are used to append entity candidates to the parse tree. The CRF model is then implemented on the tree, outputting the normalized marker sequence. Chen et al uses a cascading framework to identify nested named entities, which can be divided into three steps: boundary detection, boundary combination and entity screening. In this model, a CRF model is used to detect physical boundaries. And after the entity candidate set is finished, searching the entity positive case by adopting a maximum entropy model. Lu et al devised a reference hypergraph method to identify nested named entities. The hypergraph is a compact representation of all the probabilistic combinations of possible entities. Based on the representation, each sub-hypergraph is labeled using a log-linear method to identify nested NEs. Based on this model, Muis et al propose a hypergraph model based on neural networks to achieve nested NE recognition, since this model requires a large number of manually defined features. Ju et al identify nested entities by generating a flat NER layer from the output of the previous LSTM layer. The model dynamically stacks flat NER layers until no external entities are extracted. Even though BioNER has been extensively studied, there is still much room for improvement in its performance.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: firstly, modeling discontinuous entities existing in a biomedical text into a nested structure, constructing a boundary detection classifier by using a neural network model, identifying a starting boundary and an ending boundary of the entities, and then generating a candidate entity set through a boundary combination strategy. And finally, training a classifier to screen candidate named entities, and effectively solving the problem of poor identification performance of biomedical entities.

The technical scheme of the invention is as follows: a method for biomedically named entity recognition based on depth boundary combination, the method comprising the steps of: step one, modeling a discontinuous entity in a biomedical entity into a nested entity structure; step two, representing biomedical vocabulary information by using character level Embedding and word level Embedding; thirdly, recognizing the boundary of the biomedical entity by using a neural network model based on the word vector obtained in the second step; step four, using a boundary combination strategy to generate a candidate entity set; and fifthly, constructing a neural network classifier and screening the candidate entity set.

In the third step, the neural network model is a Bi-LSTM + CRF model.

In the fourth step, the boundary combination strategy is a greedy matching strategy.

In the fifth step, the candidate entity is taken as the center, and the sentence is divided into four parts: the left part of the entity, the positive sequence of the entity, the negative sequence of the entity and the right part of the entity are transmitted into a neural network by four channels, potential local semantic information is further mined by utilizing a convolutional neural network model, then a full connection layer is accessed, sentence global information is obtained, and the recognition of the named entity is completed.

The invention has the beneficial effects that: compared with the prior art, the technical scheme of the invention aims at the characteristics of the biomedical named entity, adopts a depth-boundary-based combined framework and combines available external resources, more accurately represents biomedical words, solves the problem of discontinuous entity recognition in a biomedical text, and completes a BioNER task. Provides more powerful theory and technical support for the BioNER, further provides a convenient and efficient entity recognition tool for researchers in the biomedical field, and effectively improves the performance of biomedical entity recognition.

The depth boundary combination frame mainly has the following advantages: (1) the granularity of the entity boundary is small and does not depend on any NLP task. The boundary information is unambiguous and easier to identify than NEs. (2) The frame has high flexibility. The framework is a cascade framework, and different models can be used for boundary detection, boundary combination and entity screening. (3) External resources can be effectively utilized. Word embedding before training can be obtained from large-scale raw data, which is beneficial for the neural network model to better understand semantic information, so that the experimental performance can be improved by using external resources.

Drawings

FIG. 1 is an exemplary diagram of nested entities and discontinuity entities according to the present invention;

FIG. 2 is a diagram of a rule-optimized boundary detection model according to the present invention;

FIG. 3 is a diagram of a depth boundary combinatorial model according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings.

Example 1: as shown in fig. 1 to 3, a biomedical named entity recognition method based on depth boundary combination comprises the following steps: step one, modeling a discontinuous entity in a biomedical entity into a nested entity structure; step two, representing biomedical vocabulary information by using character level Embedding and word level Embedding; thirdly, recognizing the boundary of the biomedical entity by using a neural network model based on the word vector obtained in the second step; step four, using a boundary combination strategy to generate a candidate entity set; and fifthly, constructing a neural network classifier and screening the candidate entity set.

In step one, this step is intended to model a representation of a discontinuity entity. An example of a discontinuity entity is shown in FIG. 1. For discontinuity entities that are difficult to represent in biomedical text, almost all relevant studies ignore the process of identifying discontinuous variables because it is difficult to model it. The present invention converts discontinuous entities into a nested structure, e.g., in the short sentence "HEL, KU812 and K562 cells", with three BioNEs: "HEL cells", "KU 812 cells", and "K562 cells", using this notation, the previous example can be converted into three nested named entities, "HEL, KU812 and K562 cells", "K562 cells".

In the second step, aiming at the characteristics of the biomedical vocabulary, the more accurate word vector is used for expressing the semantic and syntactic information of the biomedical vocabulary, and the biomedical text mining task is effectively carried out. The invention splices the character-level Embedding vector and the word-level Embedding vector to better represent the semantic information of the biomedical vocabulary. The character-level Embedding vector is generated using Bi-LSTM training on a per character basis of the re-word, and the word-level Embedding vector is represented by a glove vector trained by Stanford university on a 60 billion word basis.

In the third step, the neural network model is a Bi-LSTM + CRF model. The Bi-LSTM + CRF model is constructed for identifying the biomedical entities in the sentence. According to the entity boundary characteristics, a generalized, accurate and unified entity boundary representation method is sought, rules with the entity characteristics in the biomedical field are added based on a neural network model, and the entity boundary detection performance is optimized, so that entity boundary information is maximally reserved in the process of converting original linguistic data into high-level characteristics, and the efficient extraction and the full utilization of boundary semantic information are realized.

In step four, the boundary combining strategy is a greedy matching strategy. On the basis of entity boundary identification, a boundary assembly strategy is implemented, an entity structure containing a multi-layer nested structure is converted into a flat entity structure which is independent from each other, and nested entities or discontinuous entities contained in sentences are accurately represented. And combining in a proper mode according to the entity boundary information which is uniformly represented to generate candidate entities so as to find out nested entities and discontinuous entities contained in the entities.

And fifthly, screening out correct entities by using models such as convolutional nerves and LSTM on the basis of boundary information combination, and taking the accuracy (P value), the recall rate (R value) and the F1 value as performance indexes. With the candidate entity as the center, the sentence is divided into four parts: the left part of the entity, the positive sequence of the entity, the negative sequence of the entity and the right part of the entity are transmitted into a neural network by four channels, potential local semantic information is further mined by utilizing a convolutional neural network model, then a full connection layer is accessed, sentence global information is obtained, and the recognition of the named entity is completed.

The present invention will be further described with reference to the following examples:

to carry out the method of the invention, step one is first performed, modeling the discontinuity entities present in the biomedical entities as nested structures. For discontinuity entities that are difficult to represent in biomedical text, almost all relevant studies ignore the process of identifying discontinuous variables because it is difficult to model it. The present invention converts discontinuous entities into a nested structure, e.g., in the short sentence "HEL, KU812 and K562 cells", with three BioNEs: "HEL cells", "KU 812 cells", and "K562 cells", using this notation, the previous example can be converted into three nested named entities, "HEL, KU812 and K562 cells", "K562 cells".

And further, executing the second step to acquire semantic information of the biomedical vocabulary. The invention splices the character-level Embedding vector and the word-level Embedding vector to represent the biomedical vocabulary. The Embedding vector at the word level is generated by a look-up table. The look-up table may be initialized randomly or using pre-trained values. In the present invention, the glove word vector trained by Stanford university on a 60 hundred million word basis is used for initialization. The character-level Embedding vector is trained by a Bi-LSTM model. Each letter of a word (fixed length of 20 letters per word) is mapped into a 30-dimensional random vector, trained using a Bi-LSTM model, and the output of the model is taken as a character-level vector representation of the word. And finally, splicing the generated character-level Embedding vector and the word-level Embedding vector to be used as the final word vector representation of the word.

And after the vector representation of the biomedical vocabulary is obtained, executing a third step, and constructing a Bi-LSTM + CRF + rule model to detect the entity boundary. The model framework is shown in fig. 3 (boundary classifier). The boundary detection model used in the invention is a classical Bi-LSTM + CRF structure. And (4) introducing the vector representation of the biomedical vocabulary obtained in the step two into a neural network model, then accessing a full connection layer and a CRF layer, and outputting a normalization sequence with the maximum probability. In addition, the invention uses two modes to introduce rules in the biomedical field into the boundary detection model. First, after the model is exported, the possible entity boundaries are screened using a series of rules (e.g., words with three or more consecutive capitalized words, words with a connecting symbol such as "-", "/", etc., words with an affix such as "DNA", "RNA", etc.). And in the second mode, a series of rules in the biomedical field are mapped into a lookup table, a rule vector of each vocabulary is generated through the lookup table, the rule vector is spliced with a word vector to generate a word vector with a larger dimension, and the word vector is transmitted into a model to complete the detection of the entity boundary.

Further, step four is executed. Using a boundary combining strategy, a candidate entity set is generated. The invention uses a greedy matching strategy. The top n (n ═ 1,2, 3 …) possible starting boundaries in the range between each ending boundary and the left ending boundary are matched. Through the strategy, possible plane entities, nested entities and discontinuity entities existing in the sentence are found (modeled as a nested structure), and a candidate entity set is generated.

And further, executing a fifth step, constructing an entity classifier by using the neural network model, screening correct entities in the candidate entity set, and filtering error entities. There are many models that can be used in this process, such as Convolutional Neural Network (CNN), long-and-short-term memory neural network (RNN), Conditional Random Field (CRF), maximum entropy (SVM), etc., and the Convolutional Neural Network (CNN) model is used in the present invention. The input to this step is a sentence containing tagged candidate entities, each candidate entity having a tag indicating whether it is the correct entity. Thus, the input may be represented as a set:

wherein the content of the first and second substances,

is represented in a sentenceSkTo (1) ai pieces ofPosition to kthAnCandidate entity of location composition whose label is L_k. Briefly, this step may be described as inputting a sentence containing a labeled candidate entity, and training a classifier to distinguish whether the current entity is the correct entity. The method comprises the following specific steps: a sentence is divided into four channels by taking an entity as a boundary: entity left part, entity forward sequence, entity reverse sequence, entity right part. The length of each channel is fixed at 80. Each channel is processed by a neural network consisting of an Embedding layer, a convolutional layer and a max pooling layer. Each channel is mapped to a 768-dimensional word vector using the BERT model at the Embedding layer. And then accessing the convolution layer and the maximum pooling layer, acquiring vectors representing high-order abstract characteristics, transmitting the vectors into the full-link layer, and finally inputting one-hot vectors representing respective categories through a softmax activation function.

Finally, the invention verifies its validity on the real data set GENIA data set. The GENIA database is built by the GENIA project for developing and evaluating molecular biological information retrieval and text mining systems. The data set is derived from the biomedical literature, which contains PubMed based on three medical subject terms, human, blood cells and transcription factors, for a total of 2000 medline abstracts. The data set contains 36 fine-grained entity classes. There are 94584 entities in total. Wherein the proportion of entities containing nesting and discontinuity is 35.27%. Table 1 shows the performance of identifying entities in the GENIA dataset using the depth boundary combination method. The Layering method is to calculate the performances of the innermost layer and the outermost layer respectively, and the results of the two-layer identification are compared in memorability, so that two layers of nested entities can be identified, but semantic information provided by different categories cannot be captured. The Cascadeng method is characterized in that a category of entity is identified each time based on an LSTM sequence model, 10 mutually independent models are respectively constructed, the performance is comprehensively obtained on the basis of 10 identification results, obviously, the method cannot consider the relation among different categories, and cannot identify a multilayer nested entity to a certain extent;

table 1: performance of various entities on the GENIA dataset

To compare the present invention with the related work, we set up the experiment the same as Lu et al, and table 2 is an experimental comparison of the present invention with the related work.

Table 2: comparison of Experimental Properties

As can be seen from tables 1 and 2, the present invention effectively models the discontinuity entity representation, and can accurately identify the discontinuity entities existing in the biomedical literature. In addition, the method can overcome the defects of the traditional sequence marking method, and can identify the nested entities more efficiently.

The present invention is not described in detail, but is known to those skilled in the art. Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims

1. A biomedicine named entity recognition method based on depth boundary combination is characterized in that: the method comprises the following steps: step one, modeling a discontinuous entity in a biomedical entity into a nested entity structure; step two, representing biomedical vocabulary information by using character level Embedding and word level Embedding; thirdly, recognizing the boundary of the biomedical entity by using a neural network model based on the word vector obtained in the second step; step four, using a boundary combination strategy to generate a candidate entity set; and fifthly, constructing a neural network classifier and screening the candidate entity set.

2. The biomedical named entity recognition method based on depth-boundary combination according to claim 1, characterized in that: in the third step, the neural network model is a Bi-LSTM + CRF model.

3. The biomedical named entity recognition method based on depth-boundary combination according to claim 1, characterized in that: in the fourth step, the boundary combination strategy is a greedy matching strategy.

4. The biomedical named entity recognition method based on depth-boundary combination according to claim 1, characterized in that: in the fifth step, the candidate entity is taken as the center, and the sentence is divided into four parts: the left part of the entity, the positive sequence of the entity, the negative sequence of the entity and the right part of the entity are transmitted into a neural network by four channels, potential local semantic information is further mined by utilizing a convolutional neural network model, then a full connection layer is accessed, sentence global information is obtained, and the recognition of the named entity is completed.