CN111126040A - Biomedical named entity identification method based on depth boundary combination - Google Patents

Biomedical named entity identification method based on depth boundary combination Download PDF

Info

Publication number
CN111126040A
CN111126040A CN201911362019.XA CN201911362019A CN111126040A CN 111126040 A CN111126040 A CN 111126040A CN 201911362019 A CN201911362019 A CN 201911362019A CN 111126040 A CN111126040 A CN 111126040A
Authority
CN
China
Prior art keywords
entity
biomedical
boundary
neural network
depth
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911362019.XA
Other languages
Chinese (zh)
Other versions
CN111126040B (en
Inventor
黄瑞章
扈应
秦永彬
武乐飞
陈艳平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou University
Original Assignee
Guizhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou University filed Critical Guizhou University
Priority to CN201911362019.XA priority Critical patent/CN111126040B/en
Publication of CN111126040A publication Critical patent/CN111126040A/en
Application granted granted Critical
Publication of CN111126040B publication Critical patent/CN111126040B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a biomedicine named entity identification method based on depth boundary combination, which comprises the following steps: step one, modeling a discontinuous entity in a biomedical entity into a nested entity structure; step two, representing biomedical vocabulary information by using character level Embedding and word level Embedding; thirdly, recognizing the boundary of the biomedical entity by using a neural network model based on the word vector obtained in the second step; step four, using a boundary combination strategy to generate a candidate entity set; and fifthly, constructing a neural network classifier and screening the candidate entity set. Aiming at the characteristics of the biomedical named entity, the invention adopts a depth-boundary-based combined framework and combines available external resources, more accurately represents biomedical words, solves the problem of discontinuous entity identification in a biomedical text, completes a BioNER task, provides more powerful theoretical and technical support for the BioNER, further provides a convenient and efficient entity identification tool for researchers in the biomedical field, and effectively improves the performance of biomedical entity identification.

Description

Biomedical named entity identification method based on depth boundary combination
Technical Field
The invention relates to a biomedical named entity recognition method, in particular to a biomedical named entity recognition method based on depth boundary combination, and belongs to the technical field of natural language processing and machine learning.
Background
Currently, many people pay attention to the research on biomedicine capable of timely and effective prevention, change or treatment of diseases, and the social value and the commercial value are more prominent. Among them, many studies require a large investment and a long study period, and efficient retrieval of biomedical documents is an important means for securing the progress of the studies. However, a large amount of biomedical knowledge is stored in databases in the form of unstructured text. Statistically, PubMed central literature databases contain over 2900 tens of thousands of literature citations, covering almost all biomedical domain knowledge. Even if only a very specialized research field is focused on, most biologists have difficulty keeping up with the research progress in this field. Therefore, it becomes crucial to extract knowledge accurately from a large amount of literature. Biomedical text mining promises to achieve this goal, and in some cases can also reduce costs, providing timely access to needed knowledge and discovering explicit and implicit associations between knowledge.
Biomedical information extraction provides a content-oriented approach to processing biomedical documents, rather than simply ranking relevant biomedical documents by document similarity. Biomedical named Entity Recognition (BioNER) is one of the basic tasks of biomedical text mining, aims to recognize text blocks related to specific interested entities, and plays a key role in tasks such as disease treatment relationship extraction, gene function Recognition and the like. The named entity recognition task that is commonly referred to is to recognize corresponding names of people, places, tissues, etc. from text, however, in the biomedical field, entities such as DNA, RNA, proteins, etc. are of greater interest to biologists. BioNER is the first step in the processing of biomedical documents, and errors made during processing can lead to cascading errors that affect subsequent tasks such as relationship identification and event identification. In view of the important linguistic and semantic roles played by BioNER, its more efficient identification and classification will have great theoretical significance and practical value for biomedical research.
Compared with named entities in the general field, named entities in biomedicine (BioNEs) have the following characteristics: (1) BioNES have a number of pre-modifiers, such as Major Histocompatibility (MHC) class II genes (DNA), which make entity length variations large and entity boundaries difficult to determine. (2) There are many conjunctions or disjunctions in BioNES, i.e., two or more entity names share the same prefix (suffix) noun with the conjunctions or disjunctions. For example, the sentence human Tang natural killer cells includes two named entities, human T cell (cell _ type) and human natural killer cell (cell _ type). (3) The entity nesting phenomenon is wide. For example, among the entities Duffy antigen/chemikine receiver gene (DNA), Duffy antigen/chemikine receiver is also a type of protein that needs to be recognized. (4) In BioNES, there are many acronyms entities. These entities may also be ambiguous and not conducive to using the neural network model to obtain semantic information. For example, "TCF" may refer to both T-cell Factor (TCF) and Tissue Culture Fluid (Tissue Culture Fluid). These entities are also difficult to identify from existing dictionaries and require context to accurately infer entity type. (5) There is no strict naming convention in the biomedical literature and different representations of the same entity may exist. For example, Cholesterol, 5-Cholesten-3beta-ol and (3beta) -Cholesten-5-en-3-ol all represent the same chemical species. There have been many works for applying the existing named entity recognition methods in the general field directly to the biomedical field, however, the biomedical named entity recognition (Bio-NER) is still a challenging issue because the specificity of the named entities mentioned above rarely achieves satisfactory results. To this end, the present invention is intended to conduct studies on BioNER-related methods.
The Named Entity Recognition (NER) task is generally considered as a sequence tagging problem, in which each word in a sentence is assigned a corresponding tag (Begin of the entry (B), Inside of the entry (I), Out of the entry (O)), thereby representing its semantic information. After years of development, BioNER has gone through three major stages: dictionary-based methods, rule-based methods, machine learning-based methods.
The dictionary-based approach is to store all known named entities in a database, and use the database to perform simple, exact (or fuzzy) matching on the text. However, in contrast to the rapid growth of biomedical literature, it is not possible to build a database dictionary containing all the category entities. The rule-based approach matches named entities through artificially designed heuristic rules. Budi et al use rules consisting of grammar (e.g., part of speech), syntax (e.g., part of speech), and orthographic patterns (e.g., case) for named entity recognition. Fukuda et al used rules such as case, symbol, number, etc. to extract proteins. Etzioni et al propose a semi-supervised framework that divides the named entity recognition process into three steps: pattern learning, subclass extraction and list extraction. And automatically generating a new extraction rule by using the framework to complete the named entity identification task. However, the formulation of these rules requires a lot of manpower and material resources. The machine learning based approach has the advantage of automatically extracting decision boundaries from the annotation data. It is widely used to solve the NER problem. Typically, NER is considered a multi-classification task or a sequence labeling task. Many supervised algorithms are applied to the NER, such as Decision Trees (DT), Maximum Entropy (ME), Support Vector Machines (SVM), hidden markov (HMM), Conditional Random Fields (CRF). Using machine learning based algorithms, researchers do not have to manually write complex rules. In addition, the algorithms can also identify new named entities and categories which do not appear in the standard dictionary, and are widely applied to NER tasks.
In recent years, with the development of neural networks, Natural Language Processing (NLP) tasks have greater development potential, and deep neural networks have been applied to each NLP task and all have achieved great success. Compared with the traditional machine learning method based on artificial constructed features, the neural network can automatically extract high-order abstract features from the original input. It also has the advantage of organizing different layers (such as convolutional layers, recursive layers, pooling layers, and fully-connected layers) to implement complex nonlinear feature transformations. Many neural network models are applied to the NER task, such as Convolutional Neural Networks (CNN), long-short-term memory neural networks (LSTM), LSTM-CNNs, LSTM-CRF, etc. Gridach et al, in biomedical datasets (JNLPBA corpus and the BioCreAtivE IIGene Mention (GM) corpus), combine deep neural networks with CRF, word embedding representations and character level word representations, showing good performance in BioNER. However, these methods have little ability to identify the widely-occurring nested entities in BioNEs, which poses a significant obstacle to improving the performance of BioNER tasks.
There are fewer studies on named entity recognition with nested structures than studies on planar entity recognition. The earliest study on nested NEs was Alex et al, who compared three classical nested named entity recognition methods, hierarchical, cascading and federated. Based on the same dataset (GENIA corpus), Finkel et al uses a flatter parse tree to identify the nested NEs. In this model, rules are used to append entity candidates to the parse tree. The CRF model is then implemented on the tree, outputting the normalized marker sequence. Chen et al uses a cascading framework to identify nested named entities, which can be divided into three steps: boundary detection, boundary combination and entity screening. In this model, a CRF model is used to detect physical boundaries. And after the entity candidate set is finished, searching the entity positive case by adopting a maximum entropy model. Lu et al devised a reference hypergraph method to identify nested named entities. The hypergraph is a compact representation of all the probabilistic combinations of possible entities. Based on the representation, each sub-hypergraph is labeled using a log-linear method to identify nested NEs. Based on this model, Muis et al propose a hypergraph model based on neural networks to achieve nested NE recognition, since this model requires a large number of manually defined features. Ju et al identify nested entities by generating a flat NER layer from the output of the previous LSTM layer. The model dynamically stacks flat NER layers until no external entities are extracted. Even though BioNER has been extensively studied, there is still much room for improvement in its performance.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: firstly, modeling discontinuous entities existing in a biomedical text into a nested structure, constructing a boundary detection classifier by using a neural network model, identifying a starting boundary and an ending boundary of the entities, and then generating a candidate entity set through a boundary combination strategy. And finally, training a classifier to screen candidate named entities, and effectively solving the problem of poor identification performance of biomedical entities.
The technical scheme of the invention is as follows: a method for biomedically named entity recognition based on depth boundary combination, the method comprising the steps of: step one, modeling a discontinuous entity in a biomedical entity into a nested entity structure; step two, representing biomedical vocabulary information by using character level Embedding and word level Embedding; thirdly, recognizing the boundary of the biomedical entity by using a neural network model based on the word vector obtained in the second step; step four, using a boundary combination strategy to generate a candidate entity set; and fifthly, constructing a neural network classifier and screening the candidate entity set.
In the third step, the neural network model is a Bi-LSTM + CRF model.
In the fourth step, the boundary combination strategy is a greedy matching strategy.
In the fifth step, the candidate entity is taken as the center, and the sentence is divided into four parts: the left part of the entity, the positive sequence of the entity, the negative sequence of the entity and the right part of the entity are transmitted into a neural network by four channels, potential local semantic information is further mined by utilizing a convolutional neural network model, then a full connection layer is accessed, sentence global information is obtained, and the recognition of the named entity is completed.
The invention has the beneficial effects that: compared with the prior art, the technical scheme of the invention aims at the characteristics of the biomedical named entity, adopts a depth-boundary-based combined framework and combines available external resources, more accurately represents biomedical words, solves the problem of discontinuous entity recognition in a biomedical text, and completes a BioNER task. Provides more powerful theory and technical support for the BioNER, further provides a convenient and efficient entity recognition tool for researchers in the biomedical field, and effectively improves the performance of biomedical entity recognition.
The depth boundary combination frame mainly has the following advantages: (1) the granularity of the entity boundary is small and does not depend on any NLP task. The boundary information is unambiguous and easier to identify than NEs. (2) The frame has high flexibility. The framework is a cascade framework, and different models can be used for boundary detection, boundary combination and entity screening. (3) External resources can be effectively utilized. Word embedding before training can be obtained from large-scale raw data, which is beneficial for the neural network model to better understand semantic information, so that the experimental performance can be improved by using external resources.
Drawings
FIG. 1 is an exemplary diagram of nested entities and discontinuity entities according to the present invention;
FIG. 2 is a diagram of a rule-optimized boundary detection model according to the present invention;
FIG. 3 is a diagram of a depth boundary combinatorial model according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings.
Example 1: as shown in fig. 1 to 3, a biomedical named entity recognition method based on depth boundary combination comprises the following steps: step one, modeling a discontinuous entity in a biomedical entity into a nested entity structure; step two, representing biomedical vocabulary information by using character level Embedding and word level Embedding; thirdly, recognizing the boundary of the biomedical entity by using a neural network model based on the word vector obtained in the second step; step four, using a boundary combination strategy to generate a candidate entity set; and fifthly, constructing a neural network classifier and screening the candidate entity set.
In step one, this step is intended to model a representation of a discontinuity entity. An example of a discontinuity entity is shown in FIG. 1. For discontinuity entities that are difficult to represent in biomedical text, almost all relevant studies ignore the process of identifying discontinuous variables because it is difficult to model it. The present invention converts discontinuous entities into a nested structure, e.g., in the short sentence "HEL, KU812 and K562 cells", with three BioNEs: "HEL cells", "KU 812 cells", and "K562 cells", using this notation, the previous example can be converted into three nested named entities, "HEL, KU812 and K562 cells", "K562 cells".
In the second step, aiming at the characteristics of the biomedical vocabulary, the more accurate word vector is used for expressing the semantic and syntactic information of the biomedical vocabulary, and the biomedical text mining task is effectively carried out. The invention splices the character-level Embedding vector and the word-level Embedding vector to better represent the semantic information of the biomedical vocabulary. The character-level Embedding vector is generated using Bi-LSTM training on a per character basis of the re-word, and the word-level Embedding vector is represented by a glove vector trained by Stanford university on a 60 billion word basis.
In the third step, the neural network model is a Bi-LSTM + CRF model. The Bi-LSTM + CRF model is constructed for identifying the biomedical entities in the sentence. According to the entity boundary characteristics, a generalized, accurate and unified entity boundary representation method is sought, rules with the entity characteristics in the biomedical field are added based on a neural network model, and the entity boundary detection performance is optimized, so that entity boundary information is maximally reserved in the process of converting original linguistic data into high-level characteristics, and the efficient extraction and the full utilization of boundary semantic information are realized.
In step four, the boundary combining strategy is a greedy matching strategy. On the basis of entity boundary identification, a boundary assembly strategy is implemented, an entity structure containing a multi-layer nested structure is converted into a flat entity structure which is independent from each other, and nested entities or discontinuous entities contained in sentences are accurately represented. And combining in a proper mode according to the entity boundary information which is uniformly represented to generate candidate entities so as to find out nested entities and discontinuous entities contained in the entities.
And fifthly, screening out correct entities by using models such as convolutional nerves and LSTM on the basis of boundary information combination, and taking the accuracy (P value), the recall rate (R value) and the F1 value as performance indexes. With the candidate entity as the center, the sentence is divided into four parts: the left part of the entity, the positive sequence of the entity, the negative sequence of the entity and the right part of the entity are transmitted into a neural network by four channels, potential local semantic information is further mined by utilizing a convolutional neural network model, then a full connection layer is accessed, sentence global information is obtained, and the recognition of the named entity is completed.
The present invention will be further described with reference to the following examples:
to carry out the method of the invention, step one is first performed, modeling the discontinuity entities present in the biomedical entities as nested structures. For discontinuity entities that are difficult to represent in biomedical text, almost all relevant studies ignore the process of identifying discontinuous variables because it is difficult to model it. The present invention converts discontinuous entities into a nested structure, e.g., in the short sentence "HEL, KU812 and K562 cells", with three BioNEs: "HEL cells", "KU 812 cells", and "K562 cells", using this notation, the previous example can be converted into three nested named entities, "HEL, KU812 and K562 cells", "K562 cells".
And further, executing the second step to acquire semantic information of the biomedical vocabulary. The invention splices the character-level Embedding vector and the word-level Embedding vector to represent the biomedical vocabulary. The Embedding vector at the word level is generated by a look-up table. The look-up table may be initialized randomly or using pre-trained values. In the present invention, the glove word vector trained by Stanford university on a 60 hundred million word basis is used for initialization. The character-level Embedding vector is trained by a Bi-LSTM model. Each letter of a word (fixed length of 20 letters per word) is mapped into a 30-dimensional random vector, trained using a Bi-LSTM model, and the output of the model is taken as a character-level vector representation of the word. And finally, splicing the generated character-level Embedding vector and the word-level Embedding vector to be used as the final word vector representation of the word.
And after the vector representation of the biomedical vocabulary is obtained, executing a third step, and constructing a Bi-LSTM + CRF + rule model to detect the entity boundary. The model framework is shown in fig. 3 (boundary classifier). The boundary detection model used in the invention is a classical Bi-LSTM + CRF structure. And (4) introducing the vector representation of the biomedical vocabulary obtained in the step two into a neural network model, then accessing a full connection layer and a CRF layer, and outputting a normalization sequence with the maximum probability. In addition, the invention uses two modes to introduce rules in the biomedical field into the boundary detection model. First, after the model is exported, the possible entity boundaries are screened using a series of rules (e.g., words with three or more consecutive capitalized words, words with a connecting symbol such as "-", "/", etc., words with an affix such as "DNA", "RNA", etc.). And in the second mode, a series of rules in the biomedical field are mapped into a lookup table, a rule vector of each vocabulary is generated through the lookup table, the rule vector is spliced with a word vector to generate a word vector with a larger dimension, and the word vector is transmitted into a model to complete the detection of the entity boundary.
Further, step four is executed. Using a boundary combining strategy, a candidate entity set is generated. The invention uses a greedy matching strategy. The top n (n ═ 1,2, 3 …) possible starting boundaries in the range between each ending boundary and the left ending boundary are matched. Through the strategy, possible plane entities, nested entities and discontinuity entities existing in the sentence are found (modeled as a nested structure), and a candidate entity set is generated.
And further, executing a fifth step, constructing an entity classifier by using the neural network model, screening correct entities in the candidate entity set, and filtering error entities. There are many models that can be used in this process, such as Convolutional Neural Network (CNN), long-and-short-term memory neural network (RNN), Conditional Random Field (CRF), maximum entropy (SVM), etc., and the Convolutional Neural Network (CNN) model is used in the present invention. The input to this step is a sentence containing tagged candidate entities, each candidate entity having a tag indicating whether it is the correct entity. Thus, the input may be represented as a set:
Figure BDA0002337420230000071
wherein the content of the first and second substances,
Figure BDA0002337420230000072
is represented in a sentenceSkTo (1) ai pieces ofPosition to kthAnCandidate entity of location composition whose label is Lk. Briefly, this step may be described as inputting a sentence containing a labeled candidate entity, and training a classifier to distinguish whether the current entity is the correct entity. The method comprises the following specific steps: a sentence is divided into four channels by taking an entity as a boundary: entity left part, entity forward sequence, entity reverse sequence, entity right part. The length of each channel is fixed at 80. Each channel is processed by a neural network consisting of an Embedding layer, a convolutional layer and a max pooling layer. Each channel is mapped to a 768-dimensional word vector using the BERT model at the Embedding layer. And then accessing the convolution layer and the maximum pooling layer, acquiring vectors representing high-order abstract characteristics, transmitting the vectors into the full-link layer, and finally inputting one-hot vectors representing respective categories through a softmax activation function.
Finally, the invention verifies its validity on the real data set GENIA data set. The GENIA database is built by the GENIA project for developing and evaluating molecular biological information retrieval and text mining systems. The data set is derived from the biomedical literature, which contains PubMed based on three medical subject terms, human, blood cells and transcription factors, for a total of 2000 medline abstracts. The data set contains 36 fine-grained entity classes. There are 94584 entities in total. Wherein the proportion of entities containing nesting and discontinuity is 35.27%. Table 1 shows the performance of identifying entities in the GENIA dataset using the depth boundary combination method. The Layering method is to calculate the performances of the innermost layer and the outermost layer respectively, and the results of the two-layer identification are compared in memorability, so that two layers of nested entities can be identified, but semantic information provided by different categories cannot be captured. The Cascadeng method is characterized in that a category of entity is identified each time based on an LSTM sequence model, 10 mutually independent models are respectively constructed, the performance is comprehensively obtained on the basis of 10 identification results, obviously, the method cannot consider the relation among different categories, and cannot identify a multilayer nested entity to a certain extent;
table 1: performance of various entities on the GENIA dataset
Figure BDA0002337420230000081
Figure BDA0002337420230000091
To compare the present invention with the related work, we set up the experiment the same as Lu et al, and table 2 is an experimental comparison of the present invention with the related work.
Table 2: comparison of Experimental Properties
Figure BDA0002337420230000092
As can be seen from tables 1 and 2, the present invention effectively models the discontinuity entity representation, and can accurately identify the discontinuity entities existing in the biomedical literature. In addition, the method can overcome the defects of the traditional sequence marking method, and can identify the nested entities more efficiently.
The depth boundary combination frame mainly has the following advantages: (1) the granularity of the entity boundary is small and does not depend on any NLP task. The boundary information is unambiguous and easier to identify than NEs. (2) The frame has high flexibility. The framework is a cascade framework, and different models can be used for boundary detection, boundary combination and entity screening. (3) External resources can be effectively utilized. Word embedding before training can be obtained from large-scale raw data, which is beneficial for the neural network model to better understand semantic information, so that the experimental performance can be improved by using external resources.
The present invention is not described in detail, but is known to those skilled in the art. Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims (4)

1. A biomedicine named entity recognition method based on depth boundary combination is characterized in that: the method comprises the following steps: step one, modeling a discontinuous entity in a biomedical entity into a nested entity structure; step two, representing biomedical vocabulary information by using character level Embedding and word level Embedding; thirdly, recognizing the boundary of the biomedical entity by using a neural network model based on the word vector obtained in the second step; step four, using a boundary combination strategy to generate a candidate entity set; and fifthly, constructing a neural network classifier and screening the candidate entity set.
2. The biomedical named entity recognition method based on depth-boundary combination according to claim 1, characterized in that: in the third step, the neural network model is a Bi-LSTM + CRF model.
3. The biomedical named entity recognition method based on depth-boundary combination according to claim 1, characterized in that: in the fourth step, the boundary combination strategy is a greedy matching strategy.
4. The biomedical named entity recognition method based on depth-boundary combination according to claim 1, characterized in that: in the fifth step, the candidate entity is taken as the center, and the sentence is divided into four parts: the left part of the entity, the positive sequence of the entity, the negative sequence of the entity and the right part of the entity are transmitted into a neural network by four channels, potential local semantic information is further mined by utilizing a convolutional neural network model, then a full connection layer is accessed, sentence global information is obtained, and the recognition of the named entity is completed.
CN201911362019.XA 2019-12-26 2019-12-26 Biomedical named entity recognition method based on depth boundary combination Active CN111126040B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911362019.XA CN111126040B (en) 2019-12-26 2019-12-26 Biomedical named entity recognition method based on depth boundary combination

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911362019.XA CN111126040B (en) 2019-12-26 2019-12-26 Biomedical named entity recognition method based on depth boundary combination

Publications (2)

Publication Number Publication Date
CN111126040A true CN111126040A (en) 2020-05-08
CN111126040B CN111126040B (en) 2023-06-20

Family

ID=70502739

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911362019.XA Active CN111126040B (en) 2019-12-26 2019-12-26 Biomedical named entity recognition method based on depth boundary combination

Country Status (1)

Country Link
CN (1) CN111126040B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257446A (en) * 2020-10-20 2021-01-22 平安科技(深圳)有限公司 Named entity recognition method and device, computer equipment and readable storage medium
CN112487812A (en) * 2020-10-21 2021-03-12 上海旻浦科技有限公司 Nested entity identification method and system based on boundary identification
CN112989835A (en) * 2021-04-21 2021-06-18 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Extraction method of complex medical entities
CN113033207A (en) * 2021-04-07 2021-06-25 东北大学 Biomedical nested type entity identification method based on layer-by-layer perception mechanism
CN113569573A (en) * 2021-06-28 2021-10-29 浙江工业大学 Method and system for identifying generalization entity facing financial field
CN113807094A (en) * 2020-06-11 2021-12-17 株式会社理光 Entity identification method, device and computer readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8626700B1 (en) * 2010-04-30 2014-01-07 The Intellisis Corporation Context aware device execution for simulating neural networks in compute unified device architecture
CN107644014A (en) * 2017-09-25 2018-01-30 南京安链数据科技有限公司 A kind of name entity recognition method based on two-way LSTM and CRF
CN108229582A (en) * 2018-02-01 2018-06-29 浙江大学 Entity recognition dual training method is named in a kind of multitask towards medical domain
WO2018136308A1 (en) * 2017-01-18 2018-07-26 Microsoft Technology Licensing, Llc Organization of signal segments supporting sensed features
CN108628970A (en) * 2018-04-17 2018-10-09 大连理工大学 A kind of biomedical event joint abstracting method based on new marking mode
CN110032737A (en) * 2019-04-10 2019-07-19 贵州大学 A kind of boundary combinations name entity recognition method neural network based
US20190354582A1 (en) * 2018-05-21 2019-11-21 LEVERTON GmbH Post-filtering of named entities with machine learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8626700B1 (en) * 2010-04-30 2014-01-07 The Intellisis Corporation Context aware device execution for simulating neural networks in compute unified device architecture
WO2018136308A1 (en) * 2017-01-18 2018-07-26 Microsoft Technology Licensing, Llc Organization of signal segments supporting sensed features
CN107644014A (en) * 2017-09-25 2018-01-30 南京安链数据科技有限公司 A kind of name entity recognition method based on two-way LSTM and CRF
CN108229582A (en) * 2018-02-01 2018-06-29 浙江大学 Entity recognition dual training method is named in a kind of multitask towards medical domain
CN108628970A (en) * 2018-04-17 2018-10-09 大连理工大学 A kind of biomedical event joint abstracting method based on new marking mode
US20190354582A1 (en) * 2018-05-21 2019-11-21 LEVERTON GmbH Post-filtering of named entities with machine learning
CN110032737A (en) * 2019-04-10 2019-07-19 贵州大学 A kind of boundary combinations name entity recognition method neural network based

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MOURAD GRIDACH: "Character-level neural network for biomedical named entity recognition", 《HTTPS://WWW.SCIENCEDIRECT.COM/SCIENCE/ARTICLE/PII/S1532046417300977》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807094A (en) * 2020-06-11 2021-12-17 株式会社理光 Entity identification method, device and computer readable storage medium
CN113807094B (en) * 2020-06-11 2024-03-19 株式会社理光 Entity recognition method, entity recognition device and computer readable storage medium
CN112257446A (en) * 2020-10-20 2021-01-22 平安科技(深圳)有限公司 Named entity recognition method and device, computer equipment and readable storage medium
WO2021179708A1 (en) * 2020-10-20 2021-09-16 平安科技(深圳)有限公司 Named-entity recognition method and apparatus, computer device and readable storage medium
CN112487812A (en) * 2020-10-21 2021-03-12 上海旻浦科技有限公司 Nested entity identification method and system based on boundary identification
CN112487812B (en) * 2020-10-21 2021-07-06 上海旻浦科技有限公司 Nested entity identification method and system based on boundary identification
CN113033207A (en) * 2021-04-07 2021-06-25 东北大学 Biomedical nested type entity identification method based on layer-by-layer perception mechanism
CN113033207B (en) * 2021-04-07 2023-08-29 东北大学 Biomedical nested type entity identification method based on layer-by-layer perception mechanism
CN112989835A (en) * 2021-04-21 2021-06-18 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Extraction method of complex medical entities
CN113569573A (en) * 2021-06-28 2021-10-29 浙江工业大学 Method and system for identifying generalization entity facing financial field

Also Published As

Publication number Publication date
CN111126040B (en) 2023-06-20

Similar Documents

Publication Publication Date Title
CN111126040B (en) Biomedical named entity recognition method based on depth boundary combination
CN109446338B (en) Neural network-based drug disease relation classification method
CN106776711B (en) Chinese medical knowledge map construction method based on deep learning
CN111737496A (en) Power equipment fault knowledge map construction method
CN110364234B (en) Intelligent storage, analysis and retrieval system and method for electronic medical records
CN111680173A (en) CMR model for uniformly retrieving cross-media information
CN111554360A (en) Drug relocation prediction method based on biomedical literature and domain knowledge data
CN111222318B (en) Trigger word recognition method based on double-channel bidirectional LSTM-CRF network
Yadav et al. Feature selection for entity extraction from multiple biomedical corpora: A PSO-based approach
Wan et al. A self-attention based neural architecture for Chinese medical named entity recognition
CN111914556B (en) Emotion guiding method and system based on emotion semantic transfer pattern
CN115048447B (en) Database natural language interface system based on intelligent semantic completion
WO2021190662A1 (en) Medical text sorting method and apparatus, electronic device, and storage medium
CN112328800A (en) System and method for automatically generating programming specification question answers
CN111950283A (en) Chinese word segmentation and named entity recognition system for large-scale medical text mining
Schäfer et al. UMLS mapping and Word embeddings for ICD code assignment using the MIMIC-III intensive care database
Popchev et al. Text Mining in the Domain of Plant Genetic Resources
CN112597285A (en) Man-machine interaction method and system based on knowledge graph
CN117371523A (en) Education knowledge graph construction method and system based on man-machine hybrid enhancement
Xing et al. Phenotype extraction based on word embedding to sentence embedding cascaded approach
CN116227594A (en) Construction method of high-credibility knowledge graph of medical industry facing multi-source data
CN113468311B (en) Knowledge graph-based complex question and answer method, device and storage medium
Ronghui et al. Application of Improved Convolutional Neural Network in Text Classification.
Wang et al. Bi-directional joint embedding of encyclopedic knowledge and original text for chinese medical named entity recognition
Song et al. Biomedical named entity recognition based on recurrent neural networks with different extended methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant