CN110807327A - Biomedical entity identification method based on contextualized capsule network - Google Patents

Biomedical entity identification method based on contextualized capsule network Download PDF

Info

Publication number
CN110807327A
CN110807327A CN201910982694.6A CN201910982694A CN110807327A CN 110807327 A CN110807327 A CN 110807327A CN 201910982694 A CN201910982694 A CN 201910982694A CN 110807327 A CN110807327 A CN 110807327A
Authority
CN
China
Prior art keywords
biomedical
contextualized
capsule
entity
capsule network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910982694.6A
Other languages
Chinese (zh)
Other versions
CN110807327B (en
Inventor
陈鹏
徐博
夏锋
王悦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN201910982694.6A priority Critical patent/CN110807327B/en
Publication of CN110807327A publication Critical patent/CN110807327A/en
Application granted granted Critical
Publication of CN110807327B publication Critical patent/CN110807327B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/10Ontologies; Annotations

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of computer natural language processing, and provides a biomedical entity identification method based on a contextualized capsule network, which comprises the following steps: s1, obtaining the relevant linguistic data of biomedicine; s2, carrying out data preprocessing operation on the acquired related texts; s3, constructing a biomedical entity recognition model of the contextualized capsule network, and training on a training set; s4, carrying out named entity recognition on the unknown biomedical text by using the trained contextualized capsule network model; and S5, performing post-processing operation, namely setting all illegal labels as O on the basis of the result of the contextualized capsule network model prediction, and further improving the entity identification performance. The method provided by the invention realizes automatic recognition of named entities in biomedical documents, has higher recognition accuracy and less time overhead compared with a manual recognition mode, and has stronger generalization capability.

Description

Biomedical entity identification method based on contextualized capsule network
Technical Field
The invention belongs to the technical field of computer natural language processing, and particularly relates to a biomedical entity identification method based on a contextualized capsule network.
Background knowledge
Named entity recognition is the first step in information extraction, and this task is to identify entities in documents that have a particular meaning, such as proper nouns like names of people, places, and organizations. In the biomedical field, biomedical entity identification refers to the automatic identification of entities such as genes, proteins, diseases, and chemicals to assist biomedical experts in extracting valuable information from the vast biomedical literature. As a core task of biomedical information extraction, biomedical named entity identification has been receiving widespread attention from researchers. At present, the popular methods for the biomedicine named entity recognition task are a method based on statistical machine learning and a method based on deep learning. Statistical-based machine learning methods rely heavily on manually fabricated features, which are time consuming and costly. In addition, the size of the corpus also affects the predictive performance of the method, which is a challenge for the resource-limited biomedical named entity recognition corpus. Deep learning based methods exhibit the most advanced performance, however they are inevitably limited to encoding richer sequence structure information, such as abbreviations, ambiguous words or words, mixtures of punctuation and numbers, and the like. The invention provides a biomedical entity identification method based on a contextualized capsule network, which obtains advanced experimental results on a biomedical data set.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, provides a biomedical entity identification method based on a contextualized capsule network by utilizing the capability of the capsule network to better model important spatial levels among complex data objects, and solves the problems of high difficulty in manually extracting features, poor identification effect and the like.
The technical scheme of the invention is as follows:
a biomedical entity identification method based on contextualized capsule networks comprises the following steps:
s1, obtaining the relevant linguistic data of biomedicine;
s2, carrying out data preprocessing operation on the acquired related texts;
s3, constructing a biomedical entity recognition model of the contextualized capsule network, wherein the model is composed of a feature representation layer, a main capsule layer and a solid capsule layer 3:
s3.1 feature representation layer: the word processor, the context processor and the auxiliary processor are spliced; when a sentence with the length of N is subjected to entity recognition, firstly, a sliding window with the size of W and the step length of 1 is utilized to sequentially construct the feature representation of a text, and the connection of all features in the window is used as the feature input of a current word;
the word processor is derived from splicing pre-trained word vectors and word case and case information; the contextualization processor is a contextualization representation obtained by training a large amount of unlabeled biomedical linguistic data on the basis of ELMO; obtaining a one-hot dictionary vector by utilizing the dictionary features as an auxiliary processor; finally, the word processor, the contextualization processor and the auxiliary processor are spliced to represent the characteristics of the current word in a specific semantic space;
s3.2, main capsule layer: the encoder consists of Bi-LSTM encoders stacked in two layers; the contextual characteristics of time i are extracted using a Bi-LSTM encoder and are denoted as ui
S3.3 solid capsule layer: routing a main capsule layer to a higher level entity capsule layer using dynamic routing of shared weights as follows:
(1) mixing the main capsule uiSharing matrix w by weightjObtaining a voting vector u after transformationj|iThen, each voting vector is assigned a weight coefficient c through a softmax functionijTo compute u corresponding to each named entity class jj|iWeighted sum of Sj
(2) Finally applying the nonlinear squeeze function square to SjSo as to give a named entity class v in the next routing iterationj
(3) The formula of the above dynamic routing process is as follows:
uj|i=wjui(1)
Figure BDA0002235710060000031
vj=squash(Sj) (3)
s4, training the biomedical entity recognition model of the contextualized capsule network on the training set, and carrying out named entity recognition on the unknown biomedical text;
training a biomedical entity recognition model of the contextualized capsule network using a loss function as follows:
Lj=Ejmax(0,m+-||vj||)2+λ(1-Ej)max(0,||vj||-m-)2(4)
wherein E isjWhen entity class exists, otherwise, 0, m+,m-And λ are both hyper-parameters;
s5, post-processing operation, namely setting all illegal labels as 'O' on the basis of the result of the biomedical entity recognition model prediction of the contextualized capsule network, and further improving the entity recognition performance.
The invention has the beneficial effects that:
(1) the invention provides a contextualized capsule network model which simultaneously considers sequence information and spatial patterns and exhibits competitive performance in processing complex text data, particularly entities composed of abbreviations, ambiguous words or a mixture of words, punctuation and numbers.
(2) Unlike current named entity recognition based on sentence sequence labeling, the contextualized capsule network approach abstracts it into word-level classification problems according to context.
(3) Compared with the most advanced method at present, the contextualized capsule network model provided by the invention achieves competitive results on the identification of diseases and chemicals of data BC5 CDR. The experiments prove that the contextualized capsule network has the superiority in the identification of biomedical entities.
(4) The biomedical entity recognition method based on the contextualized capsule network, provided by the invention, realizes automatic recognition of named entities in biomedical documents, has higher recognition accuracy and less time overhead compared with a manual recognition mode, and has stronger generalization capability.
Drawings
FIG. 1 is a flow chart of biomedical entity identification of contextualized capsule networks in the present invention.
FIG. 2 is a block diagram of a contextualized capsule network model in accordance with the present invention.
Detailed Description
The technical solutions of the present invention are further described in detail with reference to specific examples so that those skilled in the art can better understand the present invention and can implement the present invention, but the examples are not intended to limit the present invention.
A biomedical entity identification method based on contextualized capsule network, and fig. 1 is a flow chart of the method, which specifically includes the following steps:
s1, obtaining the relevant linguistic data of biomedicine;
s2, carrying out data preprocessing operation on the acquired related texts;
the preprocessing operation comprises the following steps: word segmentation and number substitution. The method specifically comprises the following steps: unifying the texts according to spaces and character sets"/--><;:?[]{}()!@#$%^&*-+"is used as a segmentation point to perform word segmentation; the numbers (integer or floating point numbers) in the text are replaced with a uniform identification form ("num").
S3, constructing a biomedical entity recognition model of the contextualized capsule network, wherein the model is composed of a feature representation layer, a main capsule layer and an entity capsule layer 3, and the specific structure is shown in FIG. 2;
s3.1 feature representation layer: the word processor, the context processor and the auxiliary processor are spliced; when a sentence with the length of N is subjected to entity recognition, firstly, a sliding window with the size of W and the step length of 1 and a target word at the center of the window is utilized to sequentially construct feature representation of a text, and connection of all features in the window is used as feature input of a current word;
s3.2, main capsule layer: the encoder consists of Bi-LSTM encoders stacked in two layers; the contextual characteristics of time i are extracted using a Bi-LSTM encoder and are denoted as ui
S3.3 solid capsule layer: routing a main capsule layer to a higher level entity capsule layer using dynamic routing of shared weights as follows:
mixing the main capsule uiSharing matrix w by weightjObtaining a voting vector u after transformationj|iThen, each voting vector is assigned a weight coefficient c through a softmax functionijTo compute u corresponding to each named entity class jj|iWeighted sum of Sj(ii) a Finally applying the nonlinear squeeze function square to SjSo as to give a named entity class v in the next routing iterationj
S4, training the contextualized capsule network model on the training set, and carrying out named entity recognition on the unknown biomedical text;
and S5, performing post-processing operation, namely setting all illegal labels as O on the basis of the result of the contextualized capsule network model prediction, and further improving the entity identification performance.
Further, the illegal tag of step S5 is shown in table 1, where B denotes the beginning of an entity, I denotes the inside of an entity, and O denotes the part of a non-named entity;
table 1: legal and illegal tag sequences
Figure BDA0002235710060000051
To demonstrate the effectiveness of the contextualized capsule network proposed by the present invention, we used F1 as an evaluation index to evaluate on the currently commonly used biomedical BC5CDR datasets and compared to the current state-of-the-art methods, to obtain optimal performance in both disease and chemical identification. As shown in table 2, the present invention has significant advantages over other methods.
Table 2: algorithm comparison results

Claims (3)

1. A biomedical entity identification method based on contextualized capsule networks is characterized by comprising the following steps:
s1, obtaining the relevant linguistic data of biomedicine;
s2, carrying out data preprocessing operation on the acquired related texts;
s3, constructing a biomedical entity recognition model of the contextualized capsule network, wherein the model is composed of a feature representation layer, a main capsule layer and a solid capsule layer 3:
s3.1 feature representation layer: the word processor, the context processor and the auxiliary processor are spliced; when a sentence with the length of N is subjected to entity recognition, firstly, a sliding window with the size of W and the step length of 1 is utilized to sequentially construct the feature representation of a text, and the connection of all features in the window is used as the feature input of a current word;
s3.2, main capsule layer: the encoder consists of Bi-LSTM encoders stacked in two layers; the contextual characteristics of time i are extracted using a Bi-LSTM encoder and are denoted as ui
S3.3 solid capsule layer: routing a main capsule layer to a higher level entity capsule layer using dynamic routing of shared weights as follows:
(1) mixing the main capsule uiSharing matrix w by weightjObtaining a voting vector u after transformationj|iThen, each voting vector is assigned a weight coefficient c through a softmax functionijTo compute u corresponding to each named entity class jj|iWeighted sum of Sj
(2) Finally applying the nonlinear squeeze function square to SjSo as to give a named entity class v in the next routing iterationj
(3) The formula of the above dynamic routing process is as follows:
uj|i=wjui(1)
Figure FDA0002235710050000011
vj=squash(Sj) (3)
s4, training the biomedical entity recognition model of the contextualized capsule network on the training set, and carrying out named entity recognition on the unknown biomedical text;
training a biomedical entity recognition model of the contextualized capsule network using a loss function as follows:
Lj=Ejmax(0,m+-||vj||)2+λ(1-Ej)max(0,||vj||-m-)2(4)
wherein E isjWhen entity class exists, otherwise, 0, m+,m-And λ are both hyper-parameters;
s5, post-processing operation, namely setting all illegal labels as 'O' on the basis of the result of the biomedical entity recognition model prediction of the contextualized capsule network, and further improving the entity recognition performance.
2. The method for biomedical entity recognition based on contextualized capsule networks of claim 1, wherein the word processor is derived from pre-trained word vector and word case information concatenation; the contextualization processor is a contextualization representation obtained by training a large amount of unlabeled biomedical linguistic data on the basis of ELMO; obtaining a one-hot dictionary vector by utilizing the dictionary features as an auxiliary processor; finally, the word processor, the contextualization processor, and the auxiliary processor are stitched to represent features of the current word in a particular semantic space.
3. The method for biomedical entity identification based on contextualized capsule networks of claim 1 or claim 2, wherein in step S2, the biomedical documents are preprocessed, the preprocessing comprising: word segmentation and number replacement; the method specifically comprises the following steps: unifying the texts according to spaces and character sets"/--><;:?[]{}()!@#$%^&*-+"is used as a segmentation point to perform word segmentation; the number integer or floating point number in the text is replaced by a uniform identification form of "num".
CN201910982694.6A 2019-10-16 2019-10-16 Biomedical entity identification method based on contextualized capsule network Active CN110807327B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910982694.6A CN110807327B (en) 2019-10-16 2019-10-16 Biomedical entity identification method based on contextualized capsule network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910982694.6A CN110807327B (en) 2019-10-16 2019-10-16 Biomedical entity identification method based on contextualized capsule network

Publications (2)

Publication Number Publication Date
CN110807327A true CN110807327A (en) 2020-02-18
CN110807327B CN110807327B (en) 2022-11-08

Family

ID=69488762

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910982694.6A Active CN110807327B (en) 2019-10-16 2019-10-16 Biomedical entity identification method based on contextualized capsule network

Country Status (1)

Country Link
CN (1) CN110807327B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112348118A (en) * 2020-11-30 2021-02-09 华平信息技术股份有限公司 Image classification method based on gradient maintenance, storage medium and electronic device
CN113626567A (en) * 2021-07-28 2021-11-09 上海基绪康生物科技有限公司 Method for mining information related to genes and diseases from biomedical literature

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109977229A (en) * 2019-03-27 2019-07-05 中南大学 A kind of biomedical name entity recognition method based on all-purpose language feature
CN110019839A (en) * 2018-01-03 2019-07-16 中国科学院计算技术研究所 Medical knowledge map construction method and system based on neural network and remote supervisory
CN110083838A (en) * 2019-04-29 2019-08-02 西安交通大学 Biomedical relation extraction method based on multilayer neural network Yu external knowledge library

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019839A (en) * 2018-01-03 2019-07-16 中国科学院计算技术研究所 Medical knowledge map construction method and system based on neural network and remote supervisory
CN109977229A (en) * 2019-03-27 2019-07-05 中南大学 A kind of biomedical name entity recognition method based on all-purpose language feature
CN110083838A (en) * 2019-04-29 2019-08-02 西安交通大学 Biomedical relation extraction method based on multilayer neural network Yu external knowledge library

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李丽双等: "基于CNN-BLSTM-CRF模型的生物医学命名实体识别", 《中文信息学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112348118A (en) * 2020-11-30 2021-02-09 华平信息技术股份有限公司 Image classification method based on gradient maintenance, storage medium and electronic device
CN113626567A (en) * 2021-07-28 2021-11-09 上海基绪康生物科技有限公司 Method for mining information related to genes and diseases from biomedical literature

Also Published As

Publication number Publication date
CN110807327B (en) 2022-11-08

Similar Documents

Publication Publication Date Title
CN109902145B (en) Attention mechanism-based entity relationship joint extraction method and system
CN110222188B (en) Company notice processing method for multi-task learning and server
CN108984526B (en) Document theme vector extraction method based on deep learning
CN108897989B (en) Biological event extraction method based on candidate event element attention mechanism
CN111738007B (en) Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network
CN112487820B (en) Chinese medical named entity recognition method
Puigcerver et al. ICDAR2015 competition on keyword spotting for handwritten documents
CN110879831A (en) Chinese medicine sentence word segmentation method based on entity recognition technology
CN113806494B (en) Named entity recognition method based on pre-training language model
CN113761890B (en) Multi-level semantic information retrieval method based on BERT context awareness
CN113190656A (en) Chinese named entity extraction method based on multi-label framework and fusion features
CN111967267B (en) XLNET-based news text region extraction method and system
CN115186665B (en) Semantic-based unsupervised academic keyword extraction method and equipment
CN111222318A (en) Trigger word recognition method based on two-channel bidirectional LSTM-CRF network
CN112084435A (en) Search ranking model training method and device and search ranking method and device
CN110807327B (en) Biomedical entity identification method based on contextualized capsule network
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
Li et al. Adapting clip for phrase localization without further training
CN111651985A (en) Method and device for Chinese word segmentation
CN115935998A (en) Multi-feature financial field named entity identification method
CN114118092A (en) Quick-start interactive relation labeling and extracting framework
CN112015903B (en) Question duplication judging method and device, storage medium and computer equipment
CN112634878A (en) Speech recognition post-processing method and system and related equipment
CN114969343B (en) Weak supervision text classification method combined with relative position information
CN113792550B (en) Method and device for determining predicted answers, reading and understanding method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant