CN110807327A - Biomedical entity identification method based on contextualized capsule network - Google Patents
Biomedical entity identification method based on contextualized capsule network Download PDFInfo
- Publication number
- CN110807327A CN110807327A CN201910982694.6A CN201910982694A CN110807327A CN 110807327 A CN110807327 A CN 110807327A CN 201910982694 A CN201910982694 A CN 201910982694A CN 110807327 A CN110807327 A CN 110807327A
- Authority
- CN
- China
- Prior art keywords
- biomedical
- contextualized
- capsule
- entity
- capsule network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/10—Ontologies; Annotations
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Machine Translation (AREA)
Abstract
The invention belongs to the technical field of computer natural language processing, and provides a biomedical entity identification method based on a contextualized capsule network, which comprises the following steps: s1, obtaining the relevant linguistic data of biomedicine; s2, carrying out data preprocessing operation on the acquired related texts; s3, constructing a biomedical entity recognition model of the contextualized capsule network, and training on a training set; s4, carrying out named entity recognition on the unknown biomedical text by using the trained contextualized capsule network model; and S5, performing post-processing operation, namely setting all illegal labels as O on the basis of the result of the contextualized capsule network model prediction, and further improving the entity identification performance. The method provided by the invention realizes automatic recognition of named entities in biomedical documents, has higher recognition accuracy and less time overhead compared with a manual recognition mode, and has stronger generalization capability.
Description
Technical Field
The invention belongs to the technical field of computer natural language processing, and particularly relates to a biomedical entity identification method based on a contextualized capsule network.
Background knowledge
Named entity recognition is the first step in information extraction, and this task is to identify entities in documents that have a particular meaning, such as proper nouns like names of people, places, and organizations. In the biomedical field, biomedical entity identification refers to the automatic identification of entities such as genes, proteins, diseases, and chemicals to assist biomedical experts in extracting valuable information from the vast biomedical literature. As a core task of biomedical information extraction, biomedical named entity identification has been receiving widespread attention from researchers. At present, the popular methods for the biomedicine named entity recognition task are a method based on statistical machine learning and a method based on deep learning. Statistical-based machine learning methods rely heavily on manually fabricated features, which are time consuming and costly. In addition, the size of the corpus also affects the predictive performance of the method, which is a challenge for the resource-limited biomedical named entity recognition corpus. Deep learning based methods exhibit the most advanced performance, however they are inevitably limited to encoding richer sequence structure information, such as abbreviations, ambiguous words or words, mixtures of punctuation and numbers, and the like. The invention provides a biomedical entity identification method based on a contextualized capsule network, which obtains advanced experimental results on a biomedical data set.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, provides a biomedical entity identification method based on a contextualized capsule network by utilizing the capability of the capsule network to better model important spatial levels among complex data objects, and solves the problems of high difficulty in manually extracting features, poor identification effect and the like.
The technical scheme of the invention is as follows:
a biomedical entity identification method based on contextualized capsule networks comprises the following steps:
s1, obtaining the relevant linguistic data of biomedicine;
s2, carrying out data preprocessing operation on the acquired related texts;
s3, constructing a biomedical entity recognition model of the contextualized capsule network, wherein the model is composed of a feature representation layer, a main capsule layer and a solid capsule layer 3:
s3.1 feature representation layer: the word processor, the context processor and the auxiliary processor are spliced; when a sentence with the length of N is subjected to entity recognition, firstly, a sliding window with the size of W and the step length of 1 is utilized to sequentially construct the feature representation of a text, and the connection of all features in the window is used as the feature input of a current word;
the word processor is derived from splicing pre-trained word vectors and word case and case information; the contextualization processor is a contextualization representation obtained by training a large amount of unlabeled biomedical linguistic data on the basis of ELMO; obtaining a one-hot dictionary vector by utilizing the dictionary features as an auxiliary processor; finally, the word processor, the contextualization processor and the auxiliary processor are spliced to represent the characteristics of the current word in a specific semantic space;
s3.2, main capsule layer: the encoder consists of Bi-LSTM encoders stacked in two layers; the contextual characteristics of time i are extracted using a Bi-LSTM encoder and are denoted as ui;
S3.3 solid capsule layer: routing a main capsule layer to a higher level entity capsule layer using dynamic routing of shared weights as follows:
(1) mixing the main capsule uiSharing matrix w by weightjObtaining a voting vector u after transformationj|iThen, each voting vector is assigned a weight coefficient c through a softmax functionijTo compute u corresponding to each named entity class jj|iWeighted sum of Sj;
(2) Finally applying the nonlinear squeeze function square to SjSo as to give a named entity class v in the next routing iterationj;
(3) The formula of the above dynamic routing process is as follows:
uj|i=wjui(1)
vj=squash(Sj) (3)
s4, training the biomedical entity recognition model of the contextualized capsule network on the training set, and carrying out named entity recognition on the unknown biomedical text;
training a biomedical entity recognition model of the contextualized capsule network using a loss function as follows:
Lj=Ejmax(0,m+-||vj||)2+λ(1-Ej)max(0,||vj||-m-)2(4)
wherein E isjWhen entity class exists, otherwise, 0, m+,m-And λ are both hyper-parameters;
s5, post-processing operation, namely setting all illegal labels as 'O' on the basis of the result of the biomedical entity recognition model prediction of the contextualized capsule network, and further improving the entity recognition performance.
The invention has the beneficial effects that:
(1) the invention provides a contextualized capsule network model which simultaneously considers sequence information and spatial patterns and exhibits competitive performance in processing complex text data, particularly entities composed of abbreviations, ambiguous words or a mixture of words, punctuation and numbers.
(2) Unlike current named entity recognition based on sentence sequence labeling, the contextualized capsule network approach abstracts it into word-level classification problems according to context.
(3) Compared with the most advanced method at present, the contextualized capsule network model provided by the invention achieves competitive results on the identification of diseases and chemicals of data BC5 CDR. The experiments prove that the contextualized capsule network has the superiority in the identification of biomedical entities.
(4) The biomedical entity recognition method based on the contextualized capsule network, provided by the invention, realizes automatic recognition of named entities in biomedical documents, has higher recognition accuracy and less time overhead compared with a manual recognition mode, and has stronger generalization capability.
Drawings
FIG. 1 is a flow chart of biomedical entity identification of contextualized capsule networks in the present invention.
FIG. 2 is a block diagram of a contextualized capsule network model in accordance with the present invention.
Detailed Description
The technical solutions of the present invention are further described in detail with reference to specific examples so that those skilled in the art can better understand the present invention and can implement the present invention, but the examples are not intended to limit the present invention.
A biomedical entity identification method based on contextualized capsule network, and fig. 1 is a flow chart of the method, which specifically includes the following steps:
s1, obtaining the relevant linguistic data of biomedicine;
s2, carrying out data preprocessing operation on the acquired related texts;
the preprocessing operation comprises the following steps: word segmentation and number substitution. The method specifically comprises the following steps: unifying the texts according to spaces and character sets"/--><;:?[]{}()!@#$%^&*-+"is used as a segmentation point to perform word segmentation; the numbers (integer or floating point numbers) in the text are replaced with a uniform identification form ("num").
S3, constructing a biomedical entity recognition model of the contextualized capsule network, wherein the model is composed of a feature representation layer, a main capsule layer and an entity capsule layer 3, and the specific structure is shown in FIG. 2;
s3.1 feature representation layer: the word processor, the context processor and the auxiliary processor are spliced; when a sentence with the length of N is subjected to entity recognition, firstly, a sliding window with the size of W and the step length of 1 and a target word at the center of the window is utilized to sequentially construct feature representation of a text, and connection of all features in the window is used as feature input of a current word;
s3.2, main capsule layer: the encoder consists of Bi-LSTM encoders stacked in two layers; the contextual characteristics of time i are extracted using a Bi-LSTM encoder and are denoted as ui;
S3.3 solid capsule layer: routing a main capsule layer to a higher level entity capsule layer using dynamic routing of shared weights as follows:
mixing the main capsule uiSharing matrix w by weightjObtaining a voting vector u after transformationj|iThen, each voting vector is assigned a weight coefficient c through a softmax functionijTo compute u corresponding to each named entity class jj|iWeighted sum of Sj(ii) a Finally applying the nonlinear squeeze function square to SjSo as to give a named entity class v in the next routing iterationj;
S4, training the contextualized capsule network model on the training set, and carrying out named entity recognition on the unknown biomedical text;
and S5, performing post-processing operation, namely setting all illegal labels as O on the basis of the result of the contextualized capsule network model prediction, and further improving the entity identification performance.
Further, the illegal tag of step S5 is shown in table 1, where B denotes the beginning of an entity, I denotes the inside of an entity, and O denotes the part of a non-named entity;
table 1: legal and illegal tag sequences
To demonstrate the effectiveness of the contextualized capsule network proposed by the present invention, we used F1 as an evaluation index to evaluate on the currently commonly used biomedical BC5CDR datasets and compared to the current state-of-the-art methods, to obtain optimal performance in both disease and chemical identification. As shown in table 2, the present invention has significant advantages over other methods.
Table 2: algorithm comparison results
Claims (3)
1. A biomedical entity identification method based on contextualized capsule networks is characterized by comprising the following steps:
s1, obtaining the relevant linguistic data of biomedicine;
s2, carrying out data preprocessing operation on the acquired related texts;
s3, constructing a biomedical entity recognition model of the contextualized capsule network, wherein the model is composed of a feature representation layer, a main capsule layer and a solid capsule layer 3:
s3.1 feature representation layer: the word processor, the context processor and the auxiliary processor are spliced; when a sentence with the length of N is subjected to entity recognition, firstly, a sliding window with the size of W and the step length of 1 is utilized to sequentially construct the feature representation of a text, and the connection of all features in the window is used as the feature input of a current word;
s3.2, main capsule layer: the encoder consists of Bi-LSTM encoders stacked in two layers; the contextual characteristics of time i are extracted using a Bi-LSTM encoder and are denoted as ui;
S3.3 solid capsule layer: routing a main capsule layer to a higher level entity capsule layer using dynamic routing of shared weights as follows:
(1) mixing the main capsule uiSharing matrix w by weightjObtaining a voting vector u after transformationj|iThen, each voting vector is assigned a weight coefficient c through a softmax functionijTo compute u corresponding to each named entity class jj|iWeighted sum of Sj;
(2) Finally applying the nonlinear squeeze function square to SjSo as to give a named entity class v in the next routing iterationj;
(3) The formula of the above dynamic routing process is as follows:
uj|i=wjui(1)
vj=squash(Sj) (3)
s4, training the biomedical entity recognition model of the contextualized capsule network on the training set, and carrying out named entity recognition on the unknown biomedical text;
training a biomedical entity recognition model of the contextualized capsule network using a loss function as follows:
Lj=Ejmax(0,m+-||vj||)2+λ(1-Ej)max(0,||vj||-m-)2(4)
wherein E isjWhen entity class exists, otherwise, 0, m+,m-And λ are both hyper-parameters;
s5, post-processing operation, namely setting all illegal labels as 'O' on the basis of the result of the biomedical entity recognition model prediction of the contextualized capsule network, and further improving the entity recognition performance.
2. The method for biomedical entity recognition based on contextualized capsule networks of claim 1, wherein the word processor is derived from pre-trained word vector and word case information concatenation; the contextualization processor is a contextualization representation obtained by training a large amount of unlabeled biomedical linguistic data on the basis of ELMO; obtaining a one-hot dictionary vector by utilizing the dictionary features as an auxiliary processor; finally, the word processor, the contextualization processor, and the auxiliary processor are stitched to represent features of the current word in a particular semantic space.
3. The method for biomedical entity identification based on contextualized capsule networks of claim 1 or claim 2, wherein in step S2, the biomedical documents are preprocessed, the preprocessing comprising: word segmentation and number replacement; the method specifically comprises the following steps: unifying the texts according to spaces and character sets"/--><;:?[]{}()!@#$%^&*-+"is used as a segmentation point to perform word segmentation; the number integer or floating point number in the text is replaced by a uniform identification form of "num".
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910982694.6A CN110807327B (en) | 2019-10-16 | 2019-10-16 | Biomedical entity identification method based on contextualized capsule network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910982694.6A CN110807327B (en) | 2019-10-16 | 2019-10-16 | Biomedical entity identification method based on contextualized capsule network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110807327A true CN110807327A (en) | 2020-02-18 |
CN110807327B CN110807327B (en) | 2022-11-08 |
Family
ID=69488762
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910982694.6A Active CN110807327B (en) | 2019-10-16 | 2019-10-16 | Biomedical entity identification method based on contextualized capsule network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110807327B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112348118A (en) * | 2020-11-30 | 2021-02-09 | 华平信息技术股份有限公司 | Image classification method based on gradient maintenance, storage medium and electronic device |
CN113626567A (en) * | 2021-07-28 | 2021-11-09 | 上海基绪康生物科技有限公司 | Method for mining information related to genes and diseases from biomedical literature |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109977229A (en) * | 2019-03-27 | 2019-07-05 | 中南大学 | A kind of biomedical name entity recognition method based on all-purpose language feature |
CN110019839A (en) * | 2018-01-03 | 2019-07-16 | 中国科学院计算技术研究所 | Medical knowledge map construction method and system based on neural network and remote supervisory |
CN110083838A (en) * | 2019-04-29 | 2019-08-02 | 西安交通大学 | Biomedical relation extraction method based on multilayer neural network Yu external knowledge library |
-
2019
- 2019-10-16 CN CN201910982694.6A patent/CN110807327B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110019839A (en) * | 2018-01-03 | 2019-07-16 | 中国科学院计算技术研究所 | Medical knowledge map construction method and system based on neural network and remote supervisory |
CN109977229A (en) * | 2019-03-27 | 2019-07-05 | 中南大学 | A kind of biomedical name entity recognition method based on all-purpose language feature |
CN110083838A (en) * | 2019-04-29 | 2019-08-02 | 西安交通大学 | Biomedical relation extraction method based on multilayer neural network Yu external knowledge library |
Non-Patent Citations (1)
Title |
---|
李丽双等: "基于CNN-BLSTM-CRF模型的生物医学命名实体识别", 《中文信息学报》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112348118A (en) * | 2020-11-30 | 2021-02-09 | 华平信息技术股份有限公司 | Image classification method based on gradient maintenance, storage medium and electronic device |
CN113626567A (en) * | 2021-07-28 | 2021-11-09 | 上海基绪康生物科技有限公司 | Method for mining information related to genes and diseases from biomedical literature |
Also Published As
Publication number | Publication date |
---|---|
CN110807327B (en) | 2022-11-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109902145B (en) | Attention mechanism-based entity relationship joint extraction method and system | |
CN110222188B (en) | Company notice processing method for multi-task learning and server | |
CN108984526B (en) | Document theme vector extraction method based on deep learning | |
CN108897989B (en) | Biological event extraction method based on candidate event element attention mechanism | |
CN111738007B (en) | Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network | |
CN112487820B (en) | Chinese medical named entity recognition method | |
Puigcerver et al. | ICDAR2015 competition on keyword spotting for handwritten documents | |
CN110879831A (en) | Chinese medicine sentence word segmentation method based on entity recognition technology | |
CN113806494B (en) | Named entity recognition method based on pre-training language model | |
CN113761890B (en) | Multi-level semantic information retrieval method based on BERT context awareness | |
CN113190656A (en) | Chinese named entity extraction method based on multi-label framework and fusion features | |
CN111967267B (en) | XLNET-based news text region extraction method and system | |
CN115186665B (en) | Semantic-based unsupervised academic keyword extraction method and equipment | |
CN111222318A (en) | Trigger word recognition method based on two-channel bidirectional LSTM-CRF network | |
CN112084435A (en) | Search ranking model training method and device and search ranking method and device | |
CN110807327B (en) | Biomedical entity identification method based on contextualized capsule network | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
Li et al. | Adapting clip for phrase localization without further training | |
CN111651985A (en) | Method and device for Chinese word segmentation | |
CN115935998A (en) | Multi-feature financial field named entity identification method | |
CN114118092A (en) | Quick-start interactive relation labeling and extracting framework | |
CN112015903B (en) | Question duplication judging method and device, storage medium and computer equipment | |
CN112634878A (en) | Speech recognition post-processing method and system and related equipment | |
CN114969343B (en) | Weak supervision text classification method combined with relative position information | |
CN113792550B (en) | Method and device for determining predicted answers, reading and understanding method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |