CN110807327A

CN110807327A - Biomedical entity identification method based on contextualized capsule network

Info

Publication number: CN110807327A
Application number: CN201910982694.6A
Authority: CN
Inventors: 陈鹏; 徐博; 夏锋; 王悦
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2020-02-18
Anticipated expiration: 2039-10-16
Also published as: CN110807327B

Abstract

The invention belongs to the technical field of computer natural language processing, and provides a biomedical entity identification method based on a contextualized capsule network, which comprises the following steps: s1, obtaining the relevant linguistic data of biomedicine; s2, carrying out data preprocessing operation on the acquired related texts; s3, constructing a biomedical entity recognition model of the contextualized capsule network, and training on a training set; s4, carrying out named entity recognition on the unknown biomedical text by using the trained contextualized capsule network model; and S5, performing post-processing operation, namely setting all illegal labels as O on the basis of the result of the contextualized capsule network model prediction, and further improving the entity identification performance. The method provided by the invention realizes automatic recognition of named entities in biomedical documents, has higher recognition accuracy and less time overhead compared with a manual recognition mode, and has stronger generalization capability.

Description

Biomedical entity identification method based on contextualized capsule network

Technical Field

The invention belongs to the technical field of computer natural language processing, and particularly relates to a biomedical entity identification method based on a contextualized capsule network.

Background knowledge

Named entity recognition is the first step in information extraction, and this task is to identify entities in documents that have a particular meaning, such as proper nouns like names of people, places, and organizations. In the biomedical field, biomedical entity identification refers to the automatic identification of entities such as genes, proteins, diseases, and chemicals to assist biomedical experts in extracting valuable information from the vast biomedical literature. As a core task of biomedical information extraction, biomedical named entity identification has been receiving widespread attention from researchers. At present, the popular methods for the biomedicine named entity recognition task are a method based on statistical machine learning and a method based on deep learning. Statistical-based machine learning methods rely heavily on manually fabricated features, which are time consuming and costly. In addition, the size of the corpus also affects the predictive performance of the method, which is a challenge for the resource-limited biomedical named entity recognition corpus. Deep learning based methods exhibit the most advanced performance, however they are inevitably limited to encoding richer sequence structure information, such as abbreviations, ambiguous words or words, mixtures of punctuation and numbers, and the like. The invention provides a biomedical entity identification method based on a contextualized capsule network, which obtains advanced experimental results on a biomedical data set.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides a biomedical entity identification method based on a contextualized capsule network by utilizing the capability of the capsule network to better model important spatial levels among complex data objects, and solves the problems of high difficulty in manually extracting features, poor identification effect and the like.

The technical scheme of the invention is as follows:

a biomedical entity identification method based on contextualized capsule networks comprises the following steps:

s1, obtaining the relevant linguistic data of biomedicine;

s2, carrying out data preprocessing operation on the acquired related texts;

s3, constructing a biomedical entity recognition model of the contextualized capsule network, wherein the model is composed of a feature representation layer, a main capsule layer and a solid capsule layer 3:

s3.1 feature representation layer: the word processor, the context processor and the auxiliary processor are spliced; when a sentence with the length of N is subjected to entity recognition, firstly, a sliding window with the size of W and the step length of 1 is utilized to sequentially construct the feature representation of a text, and the connection of all features in the window is used as the feature input of a current word;

the word processor is derived from splicing pre-trained word vectors and word case and case information; the contextualization processor is a contextualization representation obtained by training a large amount of unlabeled biomedical linguistic data on the basis of ELMO; obtaining a one-hot dictionary vector by utilizing the dictionary features as an auxiliary processor; finally, the word processor, the contextualization processor and the auxiliary processor are spliced to represent the characteristics of the current word in a specific semantic space;

s3.2, main capsule layer: the encoder consists of Bi-LSTM encoders stacked in two layers; the contextual characteristics of time i are extracted using a Bi-LSTM encoder and are denoted as u_i；

S3.3 solid capsule layer: routing a main capsule layer to a higher level entity capsule layer using dynamic routing of shared weights as follows:

(1) mixing the main capsule u_iSharing matrix w by weight_jObtaining a voting vector u after transformation_j|iThen, each voting vector is assigned a weight coefficient c through a softmax function_ijTo compute u corresponding to each named entity class j_j|iWeighted sum of S_j；

(2) Finally applying the nonlinear squeeze function square to S_jSo as to give a named entity class v in the next routing iteration_j；

(3) The formula of the above dynamic routing process is as follows:

u_j|i＝w_ju_i(1)

v_j＝squash(S_j) (3)

s4, training the biomedical entity recognition model of the contextualized capsule network on the training set, and carrying out named entity recognition on the unknown biomedical text;

training a biomedical entity recognition model of the contextualized capsule network using a loss function as follows:

L_j＝E_jmax(0,m⁺-||v_j||)²+λ(1-E_j)max(0,||v_j||-m^-)²(4)

wherein E is_jWhen entity class exists, otherwise, 0, m⁺，m^-And λ are both hyper-parameters;

s5, post-processing operation, namely setting all illegal labels as 'O' on the basis of the result of the biomedical entity recognition model prediction of the contextualized capsule network, and further improving the entity recognition performance.

The invention has the beneficial effects that:

(1) the invention provides a contextualized capsule network model which simultaneously considers sequence information and spatial patterns and exhibits competitive performance in processing complex text data, particularly entities composed of abbreviations, ambiguous words or a mixture of words, punctuation and numbers.

(2) Unlike current named entity recognition based on sentence sequence labeling, the contextualized capsule network approach abstracts it into word-level classification problems according to context.

(3) Compared with the most advanced method at present, the contextualized capsule network model provided by the invention achieves competitive results on the identification of diseases and chemicals of data BC5 CDR. The experiments prove that the contextualized capsule network has the superiority in the identification of biomedical entities.

(4) The biomedical entity recognition method based on the contextualized capsule network, provided by the invention, realizes automatic recognition of named entities in biomedical documents, has higher recognition accuracy and less time overhead compared with a manual recognition mode, and has stronger generalization capability.

Drawings

FIG. 1 is a flow chart of biomedical entity identification of contextualized capsule networks in the present invention.

FIG. 2 is a block diagram of a contextualized capsule network model in accordance with the present invention.

Detailed Description

The technical solutions of the present invention are further described in detail with reference to specific examples so that those skilled in the art can better understand the present invention and can implement the present invention, but the examples are not intended to limit the present invention.

A biomedical entity identification method based on contextualized capsule network, and fig. 1 is a flow chart of the method, which specifically includes the following steps:

s1, obtaining the relevant linguistic data of biomedicine;

s2, carrying out data preprocessing operation on the acquired related texts;

the preprocessing operation comprises the following steps: word segmentation and number substitution. The method specifically comprises the following steps: unifying the texts according to spaces and character sets"/--><；:？[]{}()！@#$％^&*-+"is used as a segmentation point to perform word segmentation; the numbers (integer or floating point numbers) in the text are replaced with a uniform identification form ("num").

S3, constructing a biomedical entity recognition model of the contextualized capsule network, wherein the model is composed of a feature representation layer, a main capsule layer and an entity capsule layer 3, and the specific structure is shown in FIG. 2;

s3.1 feature representation layer: the word processor, the context processor and the auxiliary processor are spliced; when a sentence with the length of N is subjected to entity recognition, firstly, a sliding window with the size of W and the step length of 1 and a target word at the center of the window is utilized to sequentially construct feature representation of a text, and connection of all features in the window is used as feature input of a current word;

mixing the main capsule u_iSharing matrix w by weight_jObtaining a voting vector u after transformation_j|iThen, each voting vector is assigned a weight coefficient c through a softmax function_ijTo compute u corresponding to each named entity class j_j|iWeighted sum of S_j(ii) a Finally applying the nonlinear squeeze function square to S_jSo as to give a named entity class v in the next routing iteration_j；

S4, training the contextualized capsule network model on the training set, and carrying out named entity recognition on the unknown biomedical text;

and S5, performing post-processing operation, namely setting all illegal labels as O on the basis of the result of the contextualized capsule network model prediction, and further improving the entity identification performance.

Further, the illegal tag of step S5 is shown in table 1, where B denotes the beginning of an entity, I denotes the inside of an entity, and O denotes the part of a non-named entity;

table 1: legal and illegal tag sequences

To demonstrate the effectiveness of the contextualized capsule network proposed by the present invention, we used F1 as an evaluation index to evaluate on the currently commonly used biomedical BC5CDR datasets and compared to the current state-of-the-art methods, to obtain optimal performance in both disease and chemical identification. As shown in table 2, the present invention has significant advantages over other methods.

Table 2: algorithm comparison results

Claims

1. A biomedical entity identification method based on contextualized capsule networks is characterized by comprising the following steps:

s1, obtaining the relevant linguistic data of biomedicine;

s2, carrying out data preprocessing operation on the acquired related texts;

(3) The formula of the above dynamic routing process is as follows:

u_j|i＝w_ju_i(1)

v_j＝squash(S_j) (3)

L_j＝E_jmax(0,m⁺-||v_j||)²+λ(1-E_j)max(0,||v_j||-m^-)²(4)

2. The method for biomedical entity recognition based on contextualized capsule networks of claim 1, wherein the word processor is derived from pre-trained word vector and word case information concatenation; the contextualization processor is a contextualization representation obtained by training a large amount of unlabeled biomedical linguistic data on the basis of ELMO; obtaining a one-hot dictionary vector by utilizing the dictionary features as an auxiliary processor; finally, the word processor, the contextualization processor, and the auxiliary processor are stitched to represent features of the current word in a particular semantic space.

3. The method for biomedical entity identification based on contextualized capsule networks of claim 1 or claim 2, wherein in step S2, the biomedical documents are preprocessed, the preprocessing comprising: word segmentation and number replacement; the method specifically comprises the following steps: unifying the texts according to spaces and character sets"/--><；:？[]{}()！@#$％^&*-+"is used as a segmentation point to perform word segmentation; the number integer or floating point number in the text is replaced by a uniform identification form of "num".