CN108182976A

CN108182976A - A kind of clinical medicine information extracting method based on neural network

Info

Publication number: CN108182976A
Application number: CN201711462492.6A
Authority: CN
Inventors: 李辰; 王轩; 龙雨; 李质婧
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2017-12-28
Filing date: 2017-12-28
Publication date: 2018-06-19

Abstract

The invention discloses a kind of clinical medicine information extracting methods based on neural network, there is many professional vocabularies, uncommon vocabulary and temporal expression with digital and character composition etc. in usual medicine text, but the character vector obtained using convolutional neural networks can include the morphologic information of word, therefore can be very good to solve the problems, such as this.Two-way LSTM used herein can capture contextual information well simultaneously.In addition, avoiding this process of artificial design feature in machine learning using the method for neural network, the problem of field adapts to can solve.The method of the present invention all yields good result on different data fields, and energy efficiently and accurately intelligently extracts with practical value and research significance information from magnanimity medical data.

Description

A kind of clinical medicine information extracting method based on neural network

Technical field

The present invention relates to biomedical text natural language processing fields, and in particular to a kind of clinic based on neural network Medical information extracting method.

Background technology

Under the historical background of big data and " intelligent medical treatment ", text mining and the information extraction technique of medical domain have become " most important thing " focused in recent years for researchers.Medicine text traditional Chinese medicine entity information is extracted, if time, event are medicine One of vital task of big data processing.But there is data volume Pang with the unstructured medicine text data that natural language is stated Greatly, it is complicated, generate features, the related researchers such as speed is fast and quickly and accurately to be obtained from a large amount of texts valuable Knowledge and information be extremely difficult.So how not only intelligently to extract efficiently but also from magnanimity medical data with practical value And research significance medical knowledge information and carry out the expression of structuring, deeper into grasp that those are unknown, threaten people The disease information of class health is very urgent.

Existing information extracting method is broadly divided into Rule Extraction method and the method for machine learning.But due to natural language Complexity, the rule of structure are difficult to cover all entity types；According to based on the machine learning algorithm for having supervision, Then due to the particularity and complexity of medicine text, the entity information degree of overlapping that different illnesss is included is very low, entity class Also very diversification, thus different illnesss is directed to, it is required in advance being used as a part of text progress handmarking training text, And the feature artificially built is difficult to cover the feature of all entity class.When frontier when something goes wrong, can only be by again Label, training data complete learning model building.However, medical data, which is labeled, to be needed to expend a large amount of of high professional Time, cost are extremely high.

Invention content

It is an object of the invention to overcome above-mentioned problems of the prior art, a kind of facing based on neural network is provided Bed medical information extracting method.

In order to achieve the above object, the present invention adopts the following technical scheme that：

Step 1：Word segmentation processing is carried out to training text and test text first, the training text obtained after participle is used BIO labels are marked.

Step 2：Its corresponding original character vector table is built for 24 English alphabets and other common characters, and with Biomedical article in PubMed databases is the initial term vector of building of corpus.Text after being segmented based on step 1, is passed through It tables look-up and obtains each corresponding initial term vector of word and the corresponding original character vector of each character.

Step 3：Structure is carried based on the character vector that step 2 generates and the neural network medicine entity of term vector joint input Modulus type.Model is divided into encoder, decoder and grader three parts, respectively using CNN networks and Bi-LSTM networks to word Symbol vector and the input of term vector are encoded, and using Bi-LSTM network decodings, complete to classify using softmax graders.

Step 4：Training data after being marked using BIO trains above-mentioned model, passes through the reality in comparative training data BIO labels and this category of model obtain after BIO labels difference, adjustment model parameter is with Optimum Classification performance.

Step 5：Using test data, to step 4, trained model is tested, and is obtained eventually by softmax graders To BIO sequence labels extract medicine entity.

The step 2, includes the following steps：

Step 2.1：Its corresponding character vector, specific practice are initialized to existing all English characters using random number Be for initialization vector per one-dimensional, all fromIn the range of generate at random one number carry out assignment, Wherein dim is the dimension of character vector, and all original character vectors are gathered together and generate an original character vector table, The size of dim is between 30 to 50；

Step 2.2：For all characters in training text and test text, generated by searching for step 2.1 initial Character vector table obtains its corresponding original character vector.

Step 2.3：Using GLoVe term vectors model method disclosed in Stamford, the biology in PubMed databases is chosen Medical article generates initial word vector table for corpus.

Step 2.4：For all words in training text and test text, generated by searching for step 2.3 initial Term vector table obtains its corresponding initial term vector.

The step 3, includes the following steps：

Step 3.1：Using step 2.2 generate original character vector, by form each word character its it is corresponding just The beginning character vector generation original character matrix that is stitched together is sent into convolutional neural networks and is encoded (as unit of each word). The original character matrix of convolutional neural networks is input to for each, first passes around a convolutional layer, using convolution kernel by group Original character vector into each word adjacent character carries out convolution.Then by the Input matrix of convolutional layer output a to maximum Pond layer is directed to each row vector of convolutional layer output matrix, utilizes that one-dimensional generation of maximum pond layer choosing access value maximum The information that the entire row vector of table includes, then after maximum pond layer, output one it is identical with original character vector dimension to Amount.

Step 3.2：The initial term vector generated using step 2.3, by the corresponding initial word of words all in each sentence Vector is stitched together to be fed through in a Bi-LSTM and be encoded.It is included in wherein two-way LSTM there are two LSTM layers, one is Forward direction LSTM, one is backward LSTM.T-th of word being then directed in a sentence is obtained to LSTM comprising the using preceding One word to t-th word context information corresponding vector h_ft, obtained using backward LSTM comprising t-th of word to the end The corresponding vector h of one word context information_bt, vector is stitched together, the term vector h as t-th of word_t=(h_ft, h_bt)。

Step 3.3：If the corresponding character vectors of each word i of CNN layers of output are { c₁, c₂..., c_dim, Bi- The corresponding term vectors of the LSTMencoding layers of each word i of output are { wh₁, wh₂..., wh_n, then it is normalized, i.e., If c_maxThat maximum for character vector numerical value is one-dimensional, if wh_maxFor word to numerical quantity it is maximum that is one-dimensional, then final character Vector is { c₁/c_max, c₂/c_max..., c_dim/c_max, final term vector is { wh₁/wh_Max,wh₂/wh_max..., wh_n/wh_max}.It will Two above vector is spliced to obtain the corresponding final vector m of each word_i={ c₁/c_max, c₂/c_max..., c_dim/c_max, wh₁/wh_max, wh₂/wh_max..., wh_n/wh_max}.The corresponding final vector of words all in each sentence is cascaded up to be formed Then final vector matrix is input to Bi-LSTM networks as unit of sentence and is decoded.

Step 3.4：By the decoded output vectors of Bi-LSTM by final softmax layers, obtain to each word most Whole BIO label results.

Compared with prior art, the present invention has technique effect beneficial below：

The present invention is based on the methods of neural network, realize that extraction is believed with clinical medicine from medical unstructured text data Relevant entity information is ceased, such as time, event expression.There is many professional vocabularies, uncommon vocabulary in usual medicine text And temporal expression with digital and character composition etc., it is difficult accurately to carry it using the morphology syntactic feature artificially designed It takes.But the character vector obtained using convolutional neural networks can include the morphologic information of word, therefore can be very good Solve the problems, such as this.Two-way LSTM used herein can capture contextual information well simultaneously.In addition, use neural network Method avoids this process of artificial design feature in machine learning, can solve the problem of field adapts to.The present invention's Method all yields good result on different data fields, and energy efficiently and accurately is intelligently extracted from magnanimity medical data With practical value and research significance information.

Description of the drawings

Fig. 1 is the method flow diagram of the clinical medicine information extracting method of the present invention；

Specific embodiment

With reference to test example and specific embodiment, the present invention is described in further detail.But this should not be understood Range for the above-mentioned theme of the present invention is only limitted to following embodiment, all to belong to this based on the technology that the content of present invention is realized The range of invention.

The training data and test data that embodiment uses are THYME corpus disclosed in Semval-2017.

Embodiment 1：It is mainly used to extract the time serial message in clinical medicine text.Training data and test data are equal Intestinal cancer text data in THYME corpus.

Step 1：Word segmentation processing is carried out to training text and test text first, the training text obtained after participle is used BIO labels are marked.Wherein B represents the beginning that this word is an information sequence, and it is one information of composition that I, which represents this word, The intermediate word of sequence, O represent this word and are not belonging to any information sequence.

Its detailed step is as follows：

Step 2.1：Its corresponding character vector, specific practice are initialized to existing all English characters using random number Be for initialization vector per one-dimensional, all fromIn the range of generate at random one number carry out assignment, Wherein dim is the dimension of character vector, and all original character vectors are gathered together and generate an original character vector table, The size of dim is set herein as 30.If x is i-th of character of original character vector table, then the corresponding original character vectors of x are {cpi₁, cpi₂..., cpi_dim}。

Step 2.3：Using GLoVe term vectors model method disclosed in Stamford, the biology in PubMed databases is chosen Medical article generates initial word vector table for corpus, and the corresponding vector dimension of each word is 300 dimensions.

Its detailed step is as follows：

Step 3.1：Using step 2.2 generate original character vector, by form each word character its it is corresponding just The beginning character vector generation original character matrix that is stitched together is sent into convolutional neural networks and is encoded (as unit of each word). The original character matrix of convolutional neural networks is input to for each, first passes around a convolutional layer, using convolution kernel by group Original character vector into each word adjacent character carries out convolution.Then by the Input matrix of convolutional layer output a to maximum Pond layer is directed to each row vector of convolutional layer output matrix, utilizes that one-dimensional generation of maximum pond layer choosing access value maximum The information that the entire row vector of table includes, then after maximum pond layer, output one it is identical with original character vector dimension to Amount.Then for each word i, corresponding character vector is c_i={ c₁, c₂..., c_dim}.By this step, can extract each The morphologic feature of word.For such as " 2013-12-29 " etc. using character and number composition be difficult to extract its meaning of a word and The temporal expression of syntactic feature can be very good to capture its character composition form by this step.

Step 3.2：The initial term vector generated using step 2.3, by the corresponding initial word of words all in each sentence Vector is stitched together to be fed through in a Bi-LSTM and be encoded.It is included in wherein two-way LSTM there are two LSTM layers, one is Forward direction LSTM, one is backward LSTM.T-th of word being then directed in a sentence is obtained to LSTM comprising the using preceding One word to t-th word context information corresponding vector h_ft, i.e. the hidden layer of the corresponding forward direction LSTM of t-th of word is defeated Go out.The corresponding vector h comprising t-th of word to a last word context information is obtained using backward LSTM_bt, i.e., t-th The hidden layer output of the corresponding backward LSTM of word.Two vectors are stitched together, the term vector h as t-th of word_t=(h_ft, h_bt).The syntax meaning of a word feature of a word context can be obtained using this step, then for shaped like " a few days " etc. by The temporal expression of the phrase form of multiple word compositions, can be very good to extract.

Step 3.3：If the corresponding character vectors of each word i of CNN layers of output are c_i={ c₁, c₂..., c_dim, Bi- The corresponding term vectors of encoding layers of each word i of output of LSTM are h_i={ wh₁, wh₂..., wh_n, then normalizing is carried out to it Change, that is, set c_maxThat maximum for character vector numerical value is one-dimensional, if wh_maxFor word to numerical quantity it is maximum that is one-dimensional, then finally Character vector be { c₁/c_max, c₂/c_max..., c_dim/c_max, final term vector is { wh₁/wh_max, wh₂/wh_max..., wh_n/ wh_max}.Two above vector is spliced to obtain the corresponding final vector m of each word_i={ c₁/c_max, c₂/c_max..., c_dim/c_max, wh₁/wh_max, wh₂/wh_max..., wh_n/wh_max}.By the corresponding final vector cascade of words all in each sentence Get up to form final vector matrix, Bi-LSTM networks are then input to as unit of sentence and are decoded.

Step 3.4：If the corresponding output vectors of the decoded each word i of Bi-LSTM are y_i={ f₁, f₂..., f_t, it will y_iBy final softmax layers, a three-dimensional vector { x is obtained₁, x₂, x₃, it represents respectively and this word is classified as B, I, 0 Probability.Then that maximum one-dimensional corresponding label of numerical value is the final classification label of time word.It is just obtained by this step The BIO label result final to each word.

Step 4：Training data after being marked using BIO trains above-mentioned model, using passing through the reality in comparative training data The BIO labels on border obtained with this category of model after BIO labels difference, adjustment model parameter is with Optimum Classification performance.Experiment The loss function used is log-likelihood loss function

Embodiment 2：It is mainly used to extract the event information in clinical medicine text, such as craniotomy, bleed, Cancer, in order to show the ability that there is the present invention field to adapt to, training text intestinal cancer textual data in THYME corpus According to test text cancer of the brain text data in THYME databases.Training text be about intestinal cancer, test text be about The cancer of the brain, so just needing to solve the problems, such as that field adapts to.Such as Chemotherapy (chemotherapy) is either in source domain also mesh Mark field is all an event, and craniotomy (opening cranium art) is only an event in target domain.

Step 1：Training text and test text are carried out using Stamford natural language processing tool at participle first Reason, the training text obtained after participle is marked with BIO labels.It is opening for sequence of events that wherein B, which represents this word, Head, it is the intermediate word for forming a sequence of events that I, which represents this word, and O represents this word and is not belonging to any information sequence.

Its detailed step is as follows：

Step 3.1：Using step 2.2 generate original character vector, by form each word character its it is corresponding just The beginning character vector generation original character matrix that is stitched together is sent into convolutional neural networks and is encoded (as unit of each word). The original character matrix of convolutional neural networks is input to for each, a convolutional layer is first passed around, utilizes the convolution kernel of 3*3 The original character vector for forming each word adjacent character is subjected to convolution.Then by the Input matrix of convolutional layer output to one Maximum pond layer is directed to each row vector of convolutional layer output matrix, utilizes that of maximum pond layer choosing access value maximum Dimension represents the information that entire row vector includes, then after maximum pond layer, output one is identical with original character vector dimension Vector.Then for each word i, corresponding character vector is c_i={ c₁, c₂..., c_dim}.By this step, can extract The morphologic feature of each word.Due to often having many uncommon specialized vocabularies in medicine text, but they have Certain word-building rule, such as with ant-, anti-is that the word of prefix has confrontation, the meaning cancelled, inhibited, antacid (systems Sour agent), antibiotic (antibiotic), antieoagulant (anti-freezing) etc. can have been extracted more fully by this step Information.

Step 3.2：The initial term vector generated using step 2.3, by the corresponding initial word of words all in each sentence Vector is stitched together to be fed through in a Bi-LSTM and be encoded.It is included in wherein two-way LSTM there are two LSTM layers, one is Forward direction LSTM, one is backward LSTM.T-th of word being then directed in a sentence is obtained to LSTM comprising the using preceding One word to t-th word context information corresponding vector h_ft, i.e. the hidden layer of the corresponding forward direction LSTM of t-th of word is defeated Go out.The corresponding vector h comprising t-th of word to a last word context information is obtained using backward LSTM_bt, i.e., t-th The hidden layer output of the corresponding backward LSTM of word.Two vectors are stitched together, the term vector h as t-th of word_t=(h_ft, h_bt), the syntax meaning of a word feature of a word context can be obtained using this step.

Step 3.3：If the corresponding character vectors of each word i of CNN layers of output are c_i={ c₁, c₂..., c_dim, Bi- The corresponding term vectors of encoding layers of each word i of output of LSTM are h_i={ wh₁, wh₂..., wh_n, then normalizing is carried out to it Change, that is, set c_maxThat maximum for character vector numerical value is one-dimensional, if wh_maxFor word to numerical quantity it is maximum that is one-dimensional, then finally Character vector be { c₁/c_max, c₂/c_max..., c_dim/c_max, final term vector is { wh₁/wh_max, wh₂/wh_max..., wh_n/ wh_max}.Two above vector is spliced to obtain the corresponding final vector m of each word_i={ c₁/c_max, c₂/c_mnax..., c_dim/c_max, wh₁/wh_max, wh₂/wh_max..., wh_n/wh_max}.By the corresponding final vector cascade of words all in each sentence Get up to form final vector matrix, Bi-LSTM networks are then input to as unit of sentence and are decoded.By using CNN and Bi-LSTM automatically selects feature, avoids artificial design feature this process, can solve the problem of field adapts to.

Claims

1. a kind of clinical medicine information extracting method based on neural network, which is characterized in that include the following steps：

Step 1：Word segmentation processing is carried out to training text and test text first, the training text obtained after participle is marked with BIO Label are marked；

Step 2：Its corresponding original character vector table is built, and with other common characters with PubMed for 24 English alphabets Biomedical article in database is the initial term vector of building of corpus, and the text after being segmented based on step 1 is obtained by tabling look-up Obtain each corresponding initial term vector of word and the corresponding original character vector of each character；

Step 3：Build the neural network medicine entity extraction mould based on the character vector that step 2 generates with term vector joint input Type, model are divided into encoder, decoder and grader three parts, respectively using CNN networks and Bi-LSTM networks to character to It measures the input with term vector to be encoded, using Bi-LSTM network decodings, completes to classify using softmax graders；

Step 4：Training data after being marked using BIO trains above-mentioned model, is marked by the practical BIO in comparative training data The difference of the BIO labels after being obtained with this category of model is signed, adjustment model parameter is with Optimum Classification performance；

Step 5：Using test data, to step 4, trained model is tested, and is obtained eventually by softmax graders BIO sequence labels extract medicine entity.

A kind of 2. clinical medicine information extracting method based on neural network according to claim 1, which is characterized in that institute Step 2 is stated, is included the following steps：

Step 2.1：Its corresponding character vector is initialized to existing all English characters using random number, specific practice is needle To initialization vector per one-dimensional, all fromIn the range of generate at random one number carry out assignment, wherein Dim is the dimension of character vector, and all original character vectors are gathered together and generate an original character vector table, dim Size between 30 to 50；

Step 2.2：For all characters in training text and test text, the original character generated by searching for step 2.1 Vector table obtains its corresponding original character vector；

Step 2.3：Using GLoVe term vectors model method disclosed in Stamford, the biomedicine in PubMed databases is chosen Article generates initial word vector table for corpus；

Step 2.4：For all words in training text and test text, by searching for the initial word that step 2.3 generates to Scale obtains its corresponding initial term vector.

A kind of 3. clinical medicine information extracting method based on neural network according to claim 1, which is characterized in that institute Step 3 is stated, is included the following steps：

Step 3.1：The original character vector generated using step 2.2, will form the character of each word its corresponding initial word The symbol vector generation original character matrix that is stitched together is sent into convolutional neural networks and is encoded (as unit of each word), for Each is input to the original character matrix of convolutional neural networks, first passes around a convolutional layer, will be formed using convolution kernel every The original character vector of a word adjacent character carries out convolution, then by the Input matrix of convolutional layer output to a maximum pond Layer is directed to each row vector of convolutional layer output matrix, using maximum pond layer choosing access value maximum that it is one-dimensional represent it is whole The information that a row vector includes then after maximum pond layer, exports a vector identical with original character vector dimension；

Step 3.2：The initial term vector generated using step 2.3, by the corresponding initial term vector of words all in each sentence Be stitched together to be fed through in a Bi-LSTM and be encoded, wherein included in two-way LSTM there are two LSTM layers, one be it is preceding to LSTM, one is backward LSTM, then t-th of word being directed in a sentence, is obtained using preceding to LSTM comprising first Word to t-th word context information corresponding vector h_ft, obtained using backward LSTM comprising t-th of word to last one The corresponding vector h of word context information_bt, vector is stitched together, the term vector h as t-th of word_t=(h_ft, h_bt)；

Step 3.3：If the corresponding character vectors of each word i of CNN layers of output are { c₁, c₂..., c_dim, Bi-LSTM The corresponding term vectors of the encoding layers of each word i of output are { wh₁, wh₂..., wh_n, then it is normalized, that is, sets c_max That maximum for character vector numerical value is one-dimensional, if wh_maxFor word to numerical quantity it is maximum that is one-dimensional, then final character vector For { c₁/c_max, c₂/c_max..., c_dim/c_max, final term vector is { wh₁/wh_max, wh₂/wh_max..., wh_n/wh_max, by more than Two vectors are spliced to obtain the corresponding final vector m of each word_i{c₁/c_max, c₂/c_max..., c_dim/c_max, wh₁/ wh_max, wh₂/wh_max..., wh_n/wh_max, the corresponding final vector of words all in each sentence is cascaded up, and it is final to be formed Then vector matrix is input to Bi-LSTM networks as unit of sentence and is decoded；

Step 3.4：By the decoded output vectors of Bi-LSTM by final softmax layers, obtain final to each word BIO marks result.