CN108170675A

CN108170675A - A kind of name entity recognition method based on deep learning towards medical field

Info

Publication number: CN108170675A
Application number: CN201711446980.8A
Authority: CN
Inventors: 朱聪慧; 赵铁军; 关毅; 李岳
Original assignee: Harbin Fuman Science And Technology Co Ltd
Current assignee: Harbin Fuman Science And Technology Co Ltd
Priority date: 2017-12-27
Filing date: 2017-12-27
Publication date: 2018-06-15

Abstract

The present invention proposes a kind of name entity recognition method based on deep learning towards medical field.This method is that the training of shot and long term mnemon network LSTM is carried out by one, using the training corpus having in mark language material of medical field；2nd, the newer neural network parameter θ in one is labeled the path searching of result, obtains the annotation results of mark language material, and the annotation results for having the testing material in mark language material are assessed using Entity recognition evaluation criteria F values are named；3rd, in the training process in one, first with the training for having mark language material progress shot and long term mnemon network LSTM of News Field, there is mark language material further according to the model and medical field instructed, carry out the training of the model of medical field, using name Entity recognition evaluation criteria F values the annotation results for having the testing material in mark language material assess and etc. realization.The present invention is applied to name Entity recognition field.

Description

A kind of name entity recognition method based on deep learning towards medical field

Technical field

It is more particularly to a kind of towards name of the medical field based on deep learning the present invention relates to name entity recognition method Entity recognition method.

Background technology

One of the basic task of Entity recognition as information extraction is named, in question answering system, syntactic analysis, machine translation etc. There is important application in field.Medical bodies and common solid difference are larger, and Opening field entity marks corpus information to medical treatment The effect of entity mark is little；The Entity recognition of medical field lacks mark language material again simultaneously, and this is mainly due to medical bodies Judgement needs professional person to carry out, and substantially increases the cost of medical field entity mark.Therefore, it is how sharp in medical field It is highly important with a small amount of mark language material preferably mark.

Deep learning is achieving major progress in recent years, it has been proved to be able to excavate out the complexity in high dimensional data Structure is learnt.At present in natural language processing field, a kind of new word representation method：Term vector (word Embedding immense success) is achieved.

Term vector (word embedding) is the word expression for being commonly used to substitute traditional bag of words (bag of word) in recent years Method solves the problems, such as the dimension disaster that bag of words expression is brought.Researcher also found, the word obtained by train language model Vector has contained the semantic information of vocabulary, and similarity of vocabulary etc. can also be to a certain extent obtained by some algorithms Data.Further, since the training of term vector is without any mark work, so around term vector study can be much less Workload can also train on demand：Both can use a large amount of open language materials train to obtain can be general good term vector represent, The language material in same field can also be selected to train to obtain the term vector to some domain-specific, more can directly be carried out according to task Training.

The training of term vector is generally carried out using deep neural network, and in natural language processing field, cycle nerve net Network (RNN) model is one of most widely used neural network.In natural language processing field, information above is to shadow hereafter Sound is generally portrayed with language model, and information above is naturally utilized using the hidden layer of a cycle feedback in RNN models, And can use in theory to whole information above, this is that conventional language model cannot be accomplished.But RNN models are in reality There are problems that gradient disappearance in the application of border, shot and long term mnemon (Long Short-Term Memory, LSTM) is exactly pair One in RNN is effectively improved.LSTM can not be effectively retained the present situation of information needed for RNN, use mnemon (Memory Cell) records information, and introduces the update and use of multiple doors (gate) control mnemon so that required letter Breath can be preserved effectively.LSTM has been widely used in now from participle, part-of-speech tagging, name Entity recognition to machine In the natural language processings tasks such as translation.

A common technology is pre-training technology in deep neural network.Multiple achievements in research prove, use big rule Mould language material carries out the term vector that unsupervised training obtains to initialize the parameter of neural network, can be with than random initializtion training Better model is obtained, the term vector obtained this is mainly due to pre-training can be utilized on a large scale without labeled data, contained The information not having in training data, and the term vector of random initializtion can be prevented to be absorbed in office in optimization process to a certain extent Portion's extreme value.For the rare medical field of training data, can be using supplemental training is carried out without labeled data on a large scale It is very meaningful.

The model that name Entity recognition task uses at present mainly has using CRF as the conventional model of representative and depth nerve net Two class of network model, and generally also using traditional CRF models in medical field.

CRF models, in the case where training corpus extremely lacks, will appear due to not considering semantic information in annotation results A large amount of meaningless annotation results, and the semantic information that LSTM models contain can prevent this from occurring.

Invention content

The purpose of the present invention is to solve CRF models due to not considering semantic information, extremely lack in training corpus In the case of, the problem of a large amount of meaningless annotation results are will appear in annotation results, by large-scale News Field language material, And a kind of name entity recognition method based on deep learning towards medical field proposed.

Above-mentioned goal of the invention is achieved through the following technical solutions：

A kind of name entity recognition method based on deep learning towards medical field, which is characterized in that the tool of this method Body step is as follows：

Step 1：Term vector vec is carried out using the medical language material of no mark_iTraining, obtain supplement medical field language material Vocabulary voc and the corresponding term vector vec of vocabulary voc；Vec=[vec₁,vec₂,…,vec_n]；Voc=[voc₁,voc₂,…, voc_n]；Wherein i=1,2 ..., n, n are without the word type total number in mark language material；

Step 2：There is the training corpus in mark language material to carry out shot and long term mnemon network LSTM using News Field Training；Pre-training by the use of term vector vec described in step 1 as the training of the shot and long term mnemon network LSTM to Amount, using LSTM methods according to pre-training vector and x_k、y_kCalculation optimization targetUtilize gradient descent algorithm OptimizationCarry out the parameter θ of LSTM^CUpdate；It is described to there is mark language material to include training corpus and testing material, finally To the parameter of LSTMWherein, parameterFor LSTM model parameters θ^CThe numerical value when final nth iteration restrains, specifically Including：W_{X_in}、W_{h_in}、W_{c_in}、W_{X_o}、W_{h_o}、W_{c_o}、W_{X_f}、W_{h_f}、W_{c_f}、b_in、b_oOr b_f；Wherein：W_{x_in}：Hidden layer input gate inputs Weighting parameter；W_{h_in}:Hidden layer input door state input weighting parameter；W_{c_in}:Hidden layer mnemon inputs weighting parameter；W_{x_o}:It is hidden Layer out gate input weighting parameter；W_{h_o}：Hidden layer output door state input weighting parameter；W_{c_o}：Hidden layer mnemon output layer is weighed Value parameter；W_{x_f}：Hidden layer forgets door input weighting parameter；W_{h_f}：Hidden layer forgets door state input weighting parameter；W_{c_f}：Hidden layer is forgotten Door mnemon input weighting parameter；b_in:Hidden layer input gate offset parameter； b_o:Hidden layer out gate offset parameter；b_f:Hidden layer is lost Forget an offset parameter；

Wherein, x_kThe word sequence of the corresponding LSTM inputs of the training corpus having in mark language material for k-th of sample；y_kFor The corresponding annotation results vector of the training corpus having in mark language material of k-th of sample；

Step 3：There is the training corpus in mark language material to carry out shot and long term mnemon network LSTM using medical domain Training；By the use of the term vector vec that step 1 obtains as the pre-training of the training of the shot and long term mnemon network LSTM Vector, using LSTM methods according to pre-training vector and x_k、y_kCalculation optimization targetDeclined using gradient and calculated Method optimizesCarry out the update of the parameter θ of LSTM；It is described to there is mark language material to include training corpus and testing material；

Step 4：To parameter, updated LSTM is tested, and test process is：Have described in input step two and step 3 Mark language material, the newer neural network parameter θ in step 2^CThe path searching of result is labeled, has obtained mark The annotation results of language material；Using name Entity recognition evaluation criteria F values to have mark language material in testing material annotation results It is assessed, and obtains having after assessment annotation results and mark it is anticipated that specifically assessment computational methods are as follows：

The entity word sum of the correct entity word number/mark of accuracy rate=mark

The correct entity word number of recall rate=mark/entity word sum

Accuracy rate recall rate/(accuracy rate+recall rate) of F values=2

Step 5: there will be mark language material to repeat step 2 to step 4, until name Entity recognition described in step 4 is commented Estimate standard F values do not increase or repeat step 2 and step 4 number reach maximum value 50~100 times until.

Further, the newer of the parameter θ of LSTM described in step 2 is as follows：

Step 2 one：The corresponding term vector vec of vocabulary voc and vocabulary voc are subjected to pre-training；Utilize x_kIn step 1 The list entries X of LSTM neural networks is calculated wherein in the term vector vec of acquisition, X=X₁, X₂..., X_t..., X_T；

Step 2 two：Using inputting X_t, the t-1 times hidden layer h being calculated_t-1The memory list being calculated with the t-1 times First c_t-1Calculate the input gate in of the LSTM models of the t times calculating_t, LSTM models out gate o_tAnd the forgetting door of LSTM models f_t；According in_t、o_tAnd f_tMnemon value c is calculated_tWith hidden layer value h_t；Wherein, hidden layer value h_tConcrete model be： h_t= o_tgtanh(c_t)；

Step 2 three：By list entries X=X described in step 2 one₁, X₂..., X_t..., X_TInternal each element is pressed From X₁To X_TSequence be sequentially inputted to hidden layer value h described in step 2 two_tConcrete model in and obtain hidden layer forget door output h_f；Then, by list entries X=X described in step 2 one₁, X₂..., X_t..., X_TInternal each element is pressed from X_TTo X₁'s Sequence is sequentially inputted to hidden layer value h described in step 2 two_tConcrete model in and obtain hidden layer output h_b；

Step 2 four：The hidden layer result for being obtained step 2 three using the cost computational methods of the entire sequence of transfer value h_fAnd h_bIt carries out the calculating of sequence cost and obtains optimization aimOptimized using gradient descent algorithmIt carries out The parameter θ of LSTM^CUpdate；Wherein, θ^CFor word_emb, W_{X_in}、W_{h_in}、W_{c_in}、W_{X_o}、W_{h_o}、W_{c_o}、W_{X_f}、 W_{h_f}、W_{c_f}、 b_in、b_oOr b_f, wherein, word_emb is pre-training term vector weighting parameter.

Further, the newer of the parameter θ of LSTM described in step 3 is as follows：

The corresponding term vector vec of vocabulary voc and vocabulary voc are carried out pre-training by step 3 one；Utilize x_kIn step 1 The term vector vec of acquisition is calculated the list entries X of LSTM neural networks, wherein, wherein, X=X₁, X₂..., X_t..., X_T；

Step 3 two, load News Field LSTM train to obtain model parameter θ_n, in θ_nParameter basis on using input X_t, the t-1 times hidden layer h being calculated_t-1The mnemon c being calculated with the t-1 times_t-1Calculate the LSTM of the t times calculating The input gate in of model_t, LSTM models out gate o_tAnd the forgetting door f of LSTM models_t；According in_t、o_tAnd f_tIt is calculated Mnemon value c_tWith hidden layer value h_t；Wherein, hidden layer value h_tConcrete model be：h_t=o_tgtanh(c_t)；

Step 3 three：By list entries X=X described in step 3 one₁, X₂..., X_t..., X_TIt is pressed successively from X₁To X_TIt is suitable Sequence is separately input to hidden layer value h described in step 3 two_tConcrete model in and obtain hidden layer output h_f；Then, by one institute of step 2 State list entries X=X₁, X₂..., X_t..., X_TInternal each element is pressed from X_TTo X₁Sequence be separately input to step 3 Two are brought into the hidden layer value h_tConcrete model in and obtain hidden layer output h_b；

Step 3 four, the hidden layer result for being obtained step 3 three using the cost computational methods of the entire sequence of transfer value h_fAnd h_bIt carries out sequence cost and optimization aim is calculatedOptimized using gradient descent algorithmIt carries out The update of the parameter θ of LSTM；Wherein, θ word_emb, W_{X_in}、W_{h_in}、W_{c_in}、W_{X_o}、W_{h_o}、W_{c_o}、W_{X_f}、 W_{h_f}、W_{c_f}、b_in、 b_oOr b_f。

Further, step 2 one and the specific acquisition process of input X of LSTM neural networks described in step 3 one are：

The training corpus vocabulary voc ' having in mark language material is established, vo ' c and voc are merged into vocabulary VO C；VOC= VOC₁,VOC₂,VOC₃,K,VOC_N；

The corresponding vector matrix word_emb of random initializtion vocabulary VOC so that vector matrix word_emb dimensions and word Vector v ec is identical, and carries out assignment by formula (1)：

word_emb_iFor i-th of term vector in word_emb；

Finally by x_k[k1,k2]It is multiplied to obtain the input X of LSTM neural networks with word_emb：

X=x_k[k1,k2]gword_emb(2)

Wherein, x_k[k1,k2]For word sequence x_kWord sequence between middle k1 and k2.

The corresponding vector matrix word_emb of random initializtion vocabulary VOC, and keep vector after carrying out assignment by formula (1) word_emb_iIt is constant, i.e., it is not updated vector as parameter,

The corresponding vector matrix of a vocabulary in random initializtion vocabulary VOC is word_emb_para again, according to formula (3) model calculates the input X of LSTM neural networks：

X=(x_{K [k1, k2]}gword_emb)⊕(x_{K [k1, k2]}gword_emb_para) (3)。

Further：Step 2 two and the input gate in of the t times calculating LSTM model described in step 3 two_tIt is according to mould What type (4) obtained, model (4) is as follows：

in_t=σ (W_{X_in}X_t+W_{h_in}h_t-1+W_{c_in}c_t-1+b_in) (4)

Wherein, σ is sigmoid functions；W_{X_in}For with X_tThe input gate parameter matrix of multiplication；W_{h_in}For h_t-1Multiplication input gate Parameter matrix；W_{c_in}For with c_t-1The input gate parameter matrix of multiplication；b_inTo calculate the biasing of input gate.

Further, rapid 22 with step 3 two described in the t time calculate LSTM models out gate o_tIt is according to mould What type (5) obtained, model (5) is as follows：

o_t=σ (W_{X_o}X_t+W_{h_o}h_t-1+W_{c_o}c_t-1+b_o) (5)

Wherein, W_{X_o}For with X_tThe out gate parameter matrix of multiplication；W_{h_o}For h_t-1Multiplication out gate parameter matrix；W_{c_o}For with c_t-1The out gate parameter matrix of multiplication；b_oTo calculate the biasing of out gate.

Further, step 2 two and the forgetting door f of the t times calculating LSTM model described in step 3 two_tIt is basis What model (6) obtained, model (6) is as follows：

f_t=σ (W_{X_f}X_t+W_{h_f}h_t-1+W_{c_f}c_t-1+b_f) (6)

Wherein, W_{X_f}For with X_tThe forgetting door parameter matrix of multiplication；W_{h_f}For h_t-1It is multiplied and forgets door parameter matrix；W_{c_f}For with c_t-1The forgetting door parameter matrix of multiplication；b_fThe biasing of door is forgotten to calculate.

Further, according in step 2 two and step 3 two_t、o_tAnd f_tMnemon value c is calculated_tAnd hidden layer Value h_tDetailed process be：

step1：The t times first mnemon value calculated when being not added with

Wherein, W_{X_c}For with X_tThe mnemon parameter matrix of multiplication；W_{h_c}For h_t-1Multiplication mnemon parameter matrix；b_cFor The biasing of mnemon；

step 2：The input gate value in being calculated respectively according to model (4) and model (6)_t, forget gate value f_t, be not added with door When mnemon valueAnd c_t-1Calculate the mnemon value c of the t times calculating_t：

Finally, using mnemon value c_tThe out gate o being calculated with formula (5)_tThe value h of hidden layer is calculated_t, h_t's Concrete model is as follows

h_t=o_tgtanh(c_t) (9)。

Further, step 2 four in step 3 four using the cost computational methods of the entire sequence of transfer value with that will be walked The rapid 23 hidden layer result h obtained with step 3 three_fAnd h_bIt carries out sequence cost and optimization aim is calculatedIt utilizes Gradient descent algorithm optimizesCarry out LSTM the newer detailed process of parameter θ be：

The first step：First with hidden layer h_fAnd h_bSequence of calculation x_kLabeled as the cost Q of label_t：

Q_t=h_f(t)gW_f+h_b(t)gW_b+b (10)

Wherein, W_fFor with h_f(t) parameter matrix being multiplied；W_bFor with h_b(t) parameter matrix being multiplied；B is inclined for final output It puts；

Second step：The cost of label transfer is described using transfer value matrix A, wherein, transfer value A_i,jRepresent from Whole cost, that is, optimization aim of the transfer value of label i to label j, then list entries XFor：

Third walks：Using Maximum Likelihood Estimation Method, the Probability p for maximizing correct path is calculated：

cost_rightCost for correct path；

4th step：Neural network parameter θ is obtained according to the Probability p for maximizing correct path using gradient descent algorithm.

Further, the newer neural network parameter in step 3 four and step 2 four in step 2 and step 3 θ is labeled the path searching of result, obtains the annotation results specific method of language material：The cost cost of list entries X is carried out Arrangement obtains Matrix C, and the annotation results of the testing material in mark language material are obtained using viterbi algorithm calculating matrix C.

Further, it is repeated in step 5 Step 2: the number of step 3 and step 4 reaches maximum value 60~90 times.

Advantageous effect of the present invention：

A kind of name entity recognition method based on deep learning towards medical field, the present invention relates to name Entity recognitions Method, affiliated information extraction field, correlative study have facilitation to name Entity recognition research.The present invention wishes to alleviate medical treatment The Entity recognition in field lacks the problem of mark language material again, studies how medical field utilizes a large amount of News Field mark language Material and a small amount of medical field mark language material are preferably marked.By the present invention in that with deep learning method, further excavate The information that language material is contained；Large-scale corpus information is introduced to prevent model in testing simultaneously, due to occurring not having excessively Trained Opening field conventional word and the problem of reducing effect.The results show, it is this to be based on deeply towards medical field The name entity recognition method of study is spent compared with traditional medical field name entity recognition method, is more suitable for medical field Name Entity recognition.

A kind of name entity recognition method based on deep learning towards medical field, the present invention relates to name Entity recognitions Method, affiliated information extraction field, correlative study have facilitation to name Entity recognition research.The present invention wishes to alleviate medical treatment The Entity recognition in field lacks the problem of mark language material again, studies how more preferable using a small amount of mark language material progress medical field is Mark.By the present invention in that with deep learning method, the information that language material is contained further is excavated；Introduce extensive language simultaneously Information is expected to prevent model in testing, due to there is the excessively reducing effect without trained Opening field conventional word The problem of.The results show, this name entity recognition method and traditional medical based on deep learning towards medical field Field name entity recognition method is compared, and is more suitable for the name Entity recognition of medical field.

The present invention relates to name entity recognition method, the name towards medical field more particularly to based on deep learning is real Body recognition methods.There is facilitation in information extraction field belonging to the present invention to name Entity recognition research.

The purpose of the invention is to make full use of existing medical field name Entity recognition mark language material, depth is promoted Neural network is in the performance of medical field name Entity recognition task.It is marked simultaneously in order to solve medical field name Entity recognition The present situation of language material scarcity, using on a large scale without labeled data and extensive News Field language material participation model training, it is proposed that one It plants towards name entity recognition method of the medical field based on deep learning.

The correlative study of the present invention improves the performance of medical field name Entity recognition, is not only to informatics, language The evidence of correlation theory is learned, while has facilitation to natural language understanding.In order to improve the performance of name Entity recognition, this hair The bright name Entity recognition mark language material for taking full advantage of existing a small amount of medical field, by using LSTM deep neural networks Modeling, and the information of extensive raw language material is added using the pre-training technology of deep neural network, this method is compared to tradition Method is compared, and both without manually marking more Entity recognition language materials, reduces drain on manpower and material resources, and can improve medical treatment Name the performance of Entity recognition in field.

The present invention does not require the granularity that language material pre-processes, and can be both labeled by word, can also be carried out by word, this Training is expected used in depending primarily on.All seldom occur in view of many words of entity of medical field in Opening field, use Word granularity be trained may require that for pre-training language material segment, may bring some difficulty.In order to reduce people to greatest extent The consumption of power material resources, compares to recommend and is handled by word.

Generally speaking, this method propose a kind of name entity recognition methods based on deep learning towards medical field.

Using a small amount of medical language material training pattern, and the text largely crawled in online medical question and answer website is marked, it is right Two kinds of model annotation results have carried out the statistics of high frequency words, comparison such as following table：

Table 1 is compared for the high frequency words of CRF models question and answer language material test online with LSTM models

Runic is the annotation results significantly without medical meaning in table, it can be seen that LSTM performances are much better than CRF models.

Description of the drawings

Fig. 1 is a kind of name Entity recognition side based on deep learning towards medical field that specific embodiment one proposes Method flow chart；Fig. 2 is the calculation flow chart for the LSTM that specific embodiment one proposes.

Specific embodiment

With reference to specific embodiment, the present invention will be further described, but the present invention should not be limited by the examples.

Specific embodiment one：With reference to a kind of towards name of the medical field based on deep learning of Fig. 1 present embodiments Entity recognition method is specifically prepared according to following steps：

Step 1: the medical language material using no mark carries out term vector vec_iTraining, obtained supplement medical field language The corresponding term vector vec of vocabulary voc and vocabulary voc of material；Wherein, vec=[vec₁,vec₂,…,vec_n]；Voc=[voc₁, voc₂,…,voc_n]；Wherein i=1,2 ..., n；Vec=vec₁,vec₂,K,vec_i,K,vec_n；Voc=voc₁,voc₂,K, voc_i,K,voc_n；N is without the word type total number in mark language material；

Step 2: there is the training corpus in mark language material to carry out shot and long term mnemon network LSTM using News Field Training；Pre-training by the use of term vector vec described in step 1 as the training of the shot and long term mnemon network LSTM to Amount, using LSTM methods according to pre-training vector and x_k、y_kCalculation optimization targetUtilize gradient descent algorithm OptimizationCarry out the parameter θ of LSTM^CUpdate；It is described to there is mark language material to include training corpus and testing material, finally To the parameter of LSTMWherein, parameterFor LSTM model parameters θ^CThe numerical value when final nth iteration restrains, specifically Including：W_{X_in}、W_{h_in}、W_{c_in}、W_{X_o}、W_{h_o}、W_{c_o}、W_{X_f}、W_{h_f}、W_{c_f}、b_in、b_oOr b_f；Wherein：W_{x_in}：Hidden layer input gate inputs Weighting parameter；W_{h_in}:Hidden layer input door state input weighting parameter；W_{c_in}:Hidden layer mnemon inputs weighting parameter；W_{x_o}:It is hidden Layer out gate input weighting parameter；W_{h_o}：Hidden layer output door state input weighting parameter；W_{c_o}：Hidden layer mnemon output layer is weighed Value parameter；W_{x_f}：Hidden layer forgets door input weighting parameter；W_{h_f}：Hidden layer forgets door state input weighting parameter；W_{c_f}：Hidden layer is forgotten Door mnemon input weighting parameter；b_in:Hidden layer input gate offset parameter； b_o:Hidden layer out gate offset parameter；b_f:Hidden layer is lost Forget an offset parameter.

Step 3: there is the training corpus in mark language material to carry out shot and long term mnemon network LSTM using medical domain Training；By the use of the term vector vec that step 1 obtains as the pre-training of the training of the shot and long term mnemon network LSTM Vector, using LSTM methods according to pre-training vector and x_k、y_kCalculation optimization targetDeclined using gradient and calculated Method optimizesCarry out the update of the parameter θ of LSTM；It is described to there is mark language material to include training corpus and testing material；

Step 4: the test of LSTM；Input has mark language material, the newer neural network parameter θ in step 2^CInto The path searching of row annotation results obtains the annotation results of mark language material；Using naming Entity recognition evaluation criteria F values to having The annotation results of testing material in mark language material are assessed, and specific assessment computational methods are as follows：

The correct entity word number of recall rate=mark/entity word sum

Accuracy rate recall rate/(accuracy rate+recall rate) (14) of F values=2；

Step 5: there will be mark language material to repeat Step 2: step 3 and step 4, until the name Entity recognition of step 4 Evaluation criteria F values do not increase or repeat step 2 and step 3 number reach maximum value 50~100 times until.

Present embodiment effect：

A kind of name entity recognition method based on deep learning towards medical field, present embodiment are related to naming entity Recognition methods, affiliated information extraction field, correlative study have facilitation to name Entity recognition research.Present embodiment is wished The Entity recognition for alleviating medical field lacks the problem of marking language material again, studies how medical field utilizes a small amount of medical domain Mark language material and extensive News Field mark language material are preferably marked.Present embodiment is by using deep learning side Method further excavates the information that language material is contained, and the study of extensive linguistic feature is carried out using News Field mark language material； Large-scale corpus information is introduced to prevent model in testing simultaneously, due to occurring excessively without trained Opening field Conventional word and the problem of reducing effect.The results show, this name entity towards medical field based on deep learning are known Compared with other method names entity recognition method with traditional medical field, it is more suitable for the name Entity recognition of medical field.

Present embodiment is related to naming entity recognition method, the life towards medical field more particularly to based on deep learning Name entity recognition method.There is facilitation in the affiliated information extraction field of present embodiment to name Entity recognition research.

The purpose of present embodiment is in order to which existing medical field name Entity recognition is made full use of to mark language material, and borrow Extensive News Field mark language material is helped to promote performance of the deep neural network in medical field name Entity recognition task.Together When in order to solve medical field name Entity recognition mark language material scarcity present situation, using extensive medical field without labeled data Participate in model training, it is proposed that a kind of name entity recognition method based on deep learning towards medical field.

The correlative study of present embodiment improve medical field name Entity recognition performance, be not only to informatics, The evidence of linguistics correlation theory, while have facilitation to natural language understanding.In order to improve the performance of name Entity recognition, Present embodiment takes full advantage of the name Entity recognition mark language material of existing a small amount of medical field, by using LSTM depth Neural net model establishing adds the information on a large scale without mark language material, and will be new using the pre-training technology of deep neural network The model parameter in news field is dissolved into the LSTM deep neural network models of medical field.This method compares conventional method phase Than both without manually marking more Entity recognition language materials, reducing drain on manpower and material resources, and medical field life can be improved The performance of name Entity recognition.

Present embodiment does not require the granularity that language material pre-processes, and can be both labeled by word, can also by word into Row, this depends primarily on used training and expects.All seldom go out in view of many words of entity of medical field in Opening field Existing, word granularity, which is trained, may require that and segmented for pre-training language material, may bring some difficulties.In order to subtract to greatest extent The consumption of few human and material resources, compares to recommend and is handled by word.

The high frequency words of table 2CRF models question and answer language material test online with LSTM models compare

Runic is apparent meaningless annotation results in table, it can be seen that LSTM performances are much better than CRF models.

Specific embodiment two：The present embodiment is different from the first embodiment in that：

The corresponding term vector vec of vocabulary voc and vocabulary voc are carried out pre-training by step 2 one；Utilize x_kIt is obtained with step 1 To term vector vec be calculated the input X of LSTM neural networks, wherein, the input X of LSTM neural networks is calculated Using two methods, two methods are specially：It is a kind of be using term vector vec as the initial value of LSTM models selected by method That is method one；Another method is using term vector vec as the method selected by the input of LSTM neural networks i.e. method two；

Step 2 two, using inputting X_t, the t-1 times hidden layer h being calculated_t-1The memory list being calculated with the t-1 times First c_t-1Calculate the input gate in of the LSTM models of the t times calculating_t, LSTM models out gate o_tAnd the forgetting door of LSTM models f_t；According in_t、o_tAnd f_tMnemon value c is calculated_tWith hidden layer value h_t；Wherein, X=X₁, X₂..., X_t..., X_T；

Step 2 three, to list entries X, respectively from by X₁To X_TSequence be separately input to step 2 two and be brought into formula (9) the hidden layer output h obtained_f；From X_TTo X₁Sequence be separately input to step 2 two and be brought into formula (9), obtained hidden layer Export h_b；

Step 2 four, the hidden layer result for being obtained step 2 three using the cost computational methods of the entire sequence of transfer value h_fAnd h_bIt carries out sequence cost and optimization aim is calculatedOptimized using gradient descent algorithmIt carries out The parameter θ of LSTM^CUpdate；Wherein, θ^CFor word_emb, W_{X_in}、W_{h_in}、W_{c_in}、W_{X_o}、W_{h_o}、W_{c_o}、W_{X_f}、 W_{h_f}、W_{c_f}、 b_in、b_oOr b_f；Other steps and parameter are same as the specific embodiment one.

Specific embodiment three：The present embodiment is different from the first embodiment in that：

The corresponding term vector vec of vocabulary voc and vocabulary voc are carried out pre-training by step 3 one；Utilize x_kIt is obtained with step 1 To term vector vec be calculated the input X of LSTM neural networks, wherein, the input X of LSTM neural networks is calculated Using two methods, two methods are specially：It is a kind of be using term vector vec as the initial value of LSTM models selected by method That is method one；Another method is using term vector vec as the method selected by the input of LSTM neural networks i.e. method two；

Step 3 two, load News Field LSTM train to obtain model parameter θ_n, in θ_nParameter basis on using input X_t, the t-1 times hidden layer h being calculated_t-1The mnemon c being calculated with the t-1 times_t-1Calculate the LSTM of the t times calculating The input gate in of model_t, LSTM models out gate o_tAnd the forgetting door f of LSTM models_t；According in_t、o_tAnd f_tIt is calculated Mnemon value c_tWith hidden layer value h_t；Wherein, X=X₁, X₂..., X_t..., X_T；

Step 3 three, to list entries X, respectively from by X₁To X_TSequence be separately input to step 2 two and be brought into formula (9) the hidden layer output h obtained_f；From X_TTo X₁Sequence be separately input to step 2 two and be brought into formula (9), obtained hidden layer Export h_b；

Step 3 four, the hidden layer result for being obtained step 2 three using the cost computational methods of the entire sequence of transfer value h_fAnd h_bIt carries out sequence cost and optimization aim is calculatedOptimized using gradient descent algorithmIt carries out The update of the parameter θ of LSTM；Wherein, θ word_emb, W_{X_in}、W_{h_in}、W_{c_in}、W_{X_o}、W_{h_o}、W_{c_o}、W_{X_f}、 W_{h_f}、W_{c_f}、b_in、 b_oOr b_f；Other steps and parameter are same as the specific embodiment one.

Specific embodiment four：Unlike one of present embodiment and specific embodiment two to three：Step 2 one with The input X detailed processes of LSTM neural networks are calculated described in step 3 one using method one：

The training corpus vocabulary voc ' having in mark language material is established, by v o ' c and voc combinatorial word Table V OC；VOC= VOC₁,VOC₂,VOC₃,K,VOC_N；

word_emb_iFor i-th of term vector in word_emb；

X=x_k[k1,k2]gword_emb (2)

Wherein, x_k[k1,k2]For word sequence x_kWord sequence between middle k1 and k2.Other steps and parameter and specific embodiment party One of formula two or three is identical.

Specific embodiment five：Unlike one of present embodiment and specific embodiment two to three：

Step 2 one and the specific mistakes of input X that LSTM neural networks are calculated described in step 3 one using method two Journey：

The corresponding vector matrix word_emb of random initializtion vocabulary VOC, and keep vector after carrying out assignment by formula (1) word_emb_iIt is constant, i.e., it is updated not as parameter, then the corresponding vector of a vocabulary in random initializtion vocabulary VOC Matrix is word_emb_para, calculates the input X of LSTM neural networks：

By word_emb parameters it is fixed in the case of, word_emb_para then fully according to standard parameter update.Other steps Rapid and one of parameter and specific embodiment two or three are identical.

Specific embodiment six：Unlike one of present embodiment and specific embodiment two to three：Step 2 two with Calculate the input gate in of LSTM models (or mnemon) the t time described in step 3 two_tSpecially：

in_t=σ (W_{X_in}X_t+W_{h_in}h_t-1+W_{c_in}c_t-1+b_in) (4)

Wherein, σ is sigmoid functions；W_{X_in}For with X_tThe input gate parameter matrix of multiplication；W_{h_in}For h_t-1Multiplication input gate Parameter matrix；W_{c_in}For with c_t-1The input gate parameter matrix of multiplication；b_inTo calculate the biasing of input gate.Other steps and parameter It is identical with one of specific embodiment two to three.

Specific embodiment seven：Unlike one of present embodiment and specific embodiment two to three：Step 2 two with The out gate o of (or the mnemon) of the t times described in step 3 two calculating LSTM model_t(output gate's) is specific Process is：

o_t=σ (W_{X_o}X_t+W_{h_o}h_t-1+W_{c_o}c_t-1+b_o) (5)

Wherein, W_{X_o}For with X_tThe out gate parameter matrix of multiplication；W_{h_o}For h_t-1Multiplication out gate parameter matrix；W_{c_o}For with c_t-1The out gate parameter matrix of multiplication；b_oTo calculate the biasing of out gate.Other steps and parameter and specific embodiment two to One of three is identical.

Specific embodiment eight：Unlike one of present embodiment and specific embodiment two to three：Step 2 two with Forgetting door (forget gate) f of (or the mnemon) of the t times described in step 3 two calculating LSTM model_tIt is specific Process is：

f_t=σ (W_{X_f}X_t+W_{h_f}h_t-1+W_{c_f}c_t-1+b_f) (6)

Wherein, W_{X_f}For with X_tThe forgetting door parameter matrix of multiplication；W_{h_f}For h_t-1It is multiplied and forgets door parameter matrix；W_{c_f}For with c_t-1The forgetting door parameter matrix of multiplication；b_fThe biasing of door is forgotten to calculate.Other steps and parameter and specific embodiment two to One of three is identical.

Specific embodiment nine：Unlike one of present embodiment and specific embodiment two to three：Step 2 two with According in step 3 two_t、o_tAnd f_tMnemon value c is calculated_tWith hidden layer value h_tSpecially：

(1), the t times first mnemon value calculated when being not added with

(2), the input gate value in being calculated according to (4), (6)_t, forget gate value f_t, mnemon value when being not added with And c_t-1Calculate the mnemon value c of the t times calculating_t：

Finally, using mnemon value c_tThe out gate o being calculated with formula (5)_tThe value h of hidden layer is calculated_t：

h_t=o_tgtanh(c_t) (9)。

Other steps and one of parameter and specific embodiment one to six are identical.

Specific embodiment ten：Unlike one of present embodiment and specific embodiment two to three：Step 2 four with The hidden layer for being obtained step 2 three and step 3 three using the cost computational methods of the entire sequence of transfer value in step 3 four As a result h_fAnd h_bIt carries out sequence cost and optimization aim is calculatedOptimized using gradient descent algorithm Carry out the update detailed process of the parameter θ of LSTM：

(1), first with hidden layer h_fAnd h_bSequence of calculation x_kLabeled as the cost Q of label_t：

Q_t=h_f(t)gW_f+h_b(t)gW_b+b (10)

(2), transfer value matrix A is described to the cost of label transfer, if transfer value is A_i,jRepresent from label i to Whole cost, that is, optimization aim of the transfer value of label j, then list entries XFor：

(3), using Maximum Likelihood Estimation Method, the Probability p for maximizing correct path is calculated：

cost_rightCost for correct path；

Although the number in all paths is the number of an index exploding, all path costs in formula (12) it With need not traverse all paths, can be obtained in linear session using dynamic programming algorithm；

(4), using gradient descent algorithm according to the Probability p neural network parameter θ for maximizing correct path；Wherein, θ is updated Include the variable mentioned in all step 2 one, 22 as neural network parameter θ；Sequence of calculation cost is needed to obtain system Optimization aim.Other steps and one of parameter and specific embodiment two to three are identical.

Specific embodiment 11：Unlike one of present embodiment and specific embodiment two or three：Step 3 four The path searching of result is labeled with the newer neural network parameter in step 2 four in step 2 and step 3, is obtained To the annotation results specific method of language material：

The cost cost of list entries X is arranged to obtain Matrix C, is had using viterbi algorithm calculating matrix C Mark the annotation results of the testing material in language material.Other steps and one of parameter and specific embodiment two or three are identical.

Specific embodiment 12：Unlike one of present embodiment and specific embodiment two or three：In step 5 It repeats Step 2: the number of step 3 and step 4 reaches maximum value 60~90 times.Other steps and parameter and specific embodiment party One of formula two or three is identical.

Although the present invention is disclosed as above with preferred embodiment, it is not limited to the present invention, any to be familiar with this The people of technology without departing from the spirit and scope of the present invention, can do various changes and modification, therefore the protection of the present invention Range should be subject to what claims were defined.

Claims

1. a kind of name entity recognition method based on deep learning towards medical field, which is characterized in that this method it is specific Step is as follows：

Step 1：Term vector vec is carried out using the medical language material of no mark_iTraining, obtain supplement medical field language material vocabulary The corresponding term vector vec of voc and vocabulary voc；Vec=[vec₁,vec₂,…,vec_n]；Voc=[voc₁,voc₂,…,voc_n]； Wherein i=1,2 ..., n, n are without the word type total number in mark language material；

Step 2：The instruction of shot and long term mnemon network LSTM is carried out using the training corpus having in mark language material of News Field Practice；By the use of term vector vec described in step 1 as the pre-training of the training of shot and long term mnemon network LSTM vector, profit With LSTM methods according to pre-training vector and x_k、y_kCalculation optimization targetOptimized using gradient descent algorithmCarry out the parameter θ of LSTM^CUpdate；It is described to there is mark language material to include training corpus and testing material, it finally obtains The parameter of LSTMWherein, parameterFor LSTM model parameters θ^CThe numerical value when final nth iteration restrains, it is specific to wrap It includes：W_{X_in}、W_{h_in}、W_{c_in}、W_{X_o}、W_{h_o}、W_{c_o}、W_{X_f}、W_{h_f}、W_{c_f}、b_in、b_oOr b_f；Wherein：W_{x_in}：Hidden layer input gate input power Value parameter；W_{h_in}:Hidden layer input door state input weighting parameter；W_{c_in}:Hidden layer mnemon inputs weighting parameter；W_{x_o}:Hidden layer Out gate inputs weighting parameter；W_{h_o}：Hidden layer output door state input weighting parameter；W_{c_o}：Hidden layer mnemon output layer weights Parameter；W_{x_f}：Hidden layer forgets door input weighting parameter；W_{h_f}：Hidden layer forgets door state input weighting parameter；W_{c_f}：Hidden layer forgets door Mnemon inputs weighting parameter；b_in:Hidden layer input gate offset parameter；b_o:Hidden layer out gate offset parameter；b_f:Hidden layer is forgotten Door offset parameter；

Wherein, x_kThe word sequence of the corresponding LSTM inputs of the training corpus having in mark language material for k-th of sample；y_kIt is k-th The corresponding annotation results vector of the training corpus having in mark language material of sample；

Step 3：The instruction of shot and long term mnemon network LSTM is carried out using the training corpus having in mark language material of medical domain Practice；It is vectorial as the pre-training of the training of the shot and long term mnemon network LSTM by the use of the term vector vec that step 1 obtains, Using LSTM methods according to pre-training vector and x_k、y_kCalculation optimization targetIt is excellent using gradient descent algorithm ChangeCarry out the update of the parameter θ of LSTM；It is described to there is mark language material to include training corpus and testing material；

Step 4：To parameter, updated LSTM is tested, and test process is：There is mark described in input step two and step 3 Language material, the newer neural network parameter θ in step 2^CThe path searching of result is labeled, has obtained mark language material Annotation results；The annotation results for having the testing material in mark language material are carried out using Entity recognition evaluation criteria F values are named Assessment, and obtain having after assessment annotation results and mark it is anticipated that specifically assessment computational methods are as follows：

The correct entity word number of recall rate=mark/entity word sum

Accuracy rate recall rate/(accuracy rate+recall rate) of F values=2

Step 5: there will be mark language material to repeat step 2 to step 4, until the assessment mark of name Entity recognition described in step 4 Quasi- F values do not increase or repeat step 2 and step 4 number reach maximum value 50~100 times until.

2. entity recognition method is named according to claim 1, which is characterized in that the parameter θ of LSTM described in step 2^CMore Newly it is as follows：

Step 2 one：The corresponding term vector vec of vocabulary voc and vocabulary voc are subjected to pre-training；Utilize x_kIt is obtained in step 1 Term vector vec the list entries X of LSTM neural networks is calculated wherein, X=X₁, X₂..., X_t..., X_T；

Step 2 two：Using inputting X_t, the t-1 times hidden layer h being calculated_t-1The mnemon c being calculated with the t-1 times_t-1 Calculate the input gate in of the LSTM models of the t times calculating_t, LSTM models out gate o_tAnd the forgetting door f of LSTM models_t；Root According in_t、o_tAnd f_tMnemon value c is calculated_tWith hidden layer value h_t；Wherein, hidden layer value h_tConcrete model be：h_t=o_tgtanh (c_t)；

Step 2 three：By list entries X=X described in step 2 one₁, X₂..., X_t..., X_TInternal each element is pressed from X₁It arrives X_TSequence be sequentially inputted to hidden layer value h described in step 2 two_tConcrete model in and obtain hidden layer forget door output h_f；Then, By list entries X=X described in step 2 one₁, X₂..., X_t..., X_TInternal each element is pressed from X_TTo X₁Sequence successively It is input to hidden layer value h described in step 2 two_tConcrete model in and obtain hidden layer output h_b；

Step 2 four：The hidden layer result h for being obtained step 2 three using the cost computational methods of the entire sequence of transfer value_fWith h_bIt carries out the calculating of sequence cost and obtains optimization aimOptimized using gradient descent algorithmCarry out LSTM's Parameter θ^CUpdate；Wherein, θ^CFor word_emb, W_{X_in}、W_{h_in}、W_{c_in}、W_{X_o}、W_{h_o}、W_{c_o}、W_{X_f}、W_{h_f}、W_{c_f}、b_in、b_oOr b_f, wherein, word_emb is pre-training term vector weighting parameter.

3. entity recognition method is named according to claim 1, which is characterized in that the parameter θ of LSTM described in step 3 is more Newly it is as follows：

The corresponding term vector vec of vocabulary voc and vocabulary voc are carried out pre-training by step 3 one；Utilize x_kIt is obtained in step 1 Term vector vec be calculated the list entries X of LSTM neural networks, wherein, wherein, X=X₁, X₂..., X_t..., X_T；

Step 3 two, load News Field LSTM train to obtain model parameter θ_n, in θ_nParameter basis on using inputting X_t、 The t-1 times hidden layer h being calculated_t-1The mnemon c being calculated with the t-1 times_t-1Calculate the LSTM models of the t times calculating Input gate in_t, LSTM models out gate o_tAnd the forgetting door f of LSTM models_t；According in_t、o_tAnd f_tMemory is calculated Cell value c_tWith hidden layer value h_t；Wherein, hidden layer value h_tConcrete model be：h_t=o_tgtanh(c_t)；

Step 3 three：By list entries X=X described in step 3 one₁, X₂..., X_t..., X_TIt is pressed successively from X₁To X_TSequence point It is not input to hidden layer value h described in step 3 two_tConcrete model in and obtain hidden layer output h_f；It then, will be defeated described in step 2 one Enter sequence X=X₁, X₂..., X_t..., X_TInternal each element is pressed from X_TTo X₁Sequence be separately input to two band of step 3 Enter to the hidden layer value h_tConcrete model in and obtain hidden layer output h_b；

Step 3 four, the hidden layer result h for being obtained step 3 three using the cost computational methods of the entire sequence of transfer value_fWith h_bIt carries out sequence cost and optimization aim is calculatedOptimized using gradient descent algorithmCarry out LSTM Parameter θ update；Wherein, θ word_emb, W_{X_in}、W_{h_in}、W_{c_in}、W_{X_o}、W_{h_o}、W_{c_o}、W_{X_f}、W_{h_f}、W_{c_f}、b_in、b_oOr b_f。

4. entity recognition method is named according to Claims 2 or 3, which is characterized in that step 2 one and institute in step 3 one The specific acquisition process of input X for stating LSTM neural networks is：

The training corpus vocabulary voc ' having in mark language material is established, vo ' c and voc are merged into vocabulary VO C；

VOC=VOC₁,VOC₂,VOC₃,K,VOC_N；

The corresponding vector matrix word_emb of random initializtion vocabulary VOC so that vector matrix word_emb dimensions and term vector Vec is identical, and carries out assignment by formula (1)：

word_emb_iFor i-th of term vector in word_emb；

X=x_k[k1,k2]gword_emb (2)

Wherein, x_k[k1,k2]For word sequence x_kWord sequence between middle k1 and k2.

5. entity recognition method is named according to Claims 2 or 3, which is characterized in that step 2 one and institute in step 3 one The specific acquisition process of input X for stating LSTM neural networks is：

The corresponding vector matrix of a vocabulary in random initializtion vocabulary VOC is word_emb_para again, according to formula (3) Model calculate LSTM neural networks input X：

6. entity recognition method is named according to Claims 2 or 3, it is characterised in that：Described in step 2 two and step 3 two The t times calculating LSTM models input gate in_tIt is to be obtained according to model (4), model (4) is as follows：

in_t=σ (W_{X_in}X_t+W_{h_in}h_t-1+W_{c_in}c_t-1+b_in) (4)

7. entity recognition method is named according to Claims 2 or 3, which is characterized in that rapid 22 with step 3 two described in The out gate o of the t times calculating LSTM model_tIt is to be obtained according to model (5), model (5) is as follows：

o_t=σ (W_{X_o}X_t+W_{h_o}h_t-1+W_{c_o}c_t-1+b_o) (5)

Wherein, W_{X_o}For with X_tThe out gate parameter matrix of multiplication；W_{h_o}For h_t-1Multiplication out gate parameter matrix；W_{c_o}For with c_t-1Phase The out gate parameter matrix multiplied；b_oTo calculate the biasing of out gate.

8. entity recognition method is named according to Claims 2 or 3, which is characterized in that described in step 2 two and step 3 two The t times calculating LSTM models forgetting door f_tIt is to be obtained according to model (6), model (6) is as follows：

f_t=σ (W_{X_f}X_t+W_{h_f}h_t-1+W_{c_f}c_t-1+b_f) (6)

Wherein, W_{X_f}For with X_tThe forgetting door parameter matrix of multiplication；W_{h_f}For h_t-1It is multiplied and forgets door parameter matrix；W_{c_f}For with c_t-1Phase The forgetting door parameter matrix multiplied；b_fThe biasing of door is forgotten to calculate.

9. entity recognition method is named according to Claims 2 or 3, which is characterized in that step 2 two and root in step 3 two According in_t、o_tAnd f_tMnemon value c is calculated_tWith hidden layer value h_tDetailed process be：

step1：The t times first mnemon value calculated when being not added with

Wherein, W_{X_c}For with X_tThe mnemon parameter matrix of multiplication；W_{h_c}For h_t-1Multiplication mnemon parameter matrix；b_cFor memory The biasing of unit；

step2：The input gate value in being calculated respectively according to model (4) and model (6)_t, forget gate value f_t, when being not added with Mnemon valueAnd c_t-1Calculate the mnemon value c of the t times calculating_t：

Finally, using mnemon value c_tThe out gate o being calculated with formula (5)_tThe value h of hidden layer is calculated_t, h_tSpecific mould Type is as follows

h_t=o_tgtanh(c_t) (9)。

10. entity recognition method is named according to Claims 2 or 3, which is characterized in that step 2 four in step 3 four with adopting The hidden layer result h for being obtained step 2 three and step 3 three with the cost computational methods of the entire sequence of transfer value_fAnd h_bIt carries out Optimization aim is calculated in sequence costOptimized using gradient descent algorithmCarry out the parameter of LSTM Newer detailed process is：

Q_t=h_f(t)gW_f+h_b(t)gW_b+b (10)

Wherein, W_fFor with h_f(t) parameter matrix being multiplied；W_bFor with h_b(t) parameter matrix being multiplied；B is biased for final output；

Second step：The cost of label transfer is described using transfer value matrix A, wherein, transfer value A_i,jIt represents from label Whole cost, that is, optimization aim of the transfer value of i to label j, then list entries XFor：

cost_rightCost for correct path；