CN109871538A

CN109871538A - A kind of Chinese electronic health record name entity recognition method

Info

Publication number: CN109871538A
Application number: CN201910119391.1A
Authority: CN
Inventors: 董守斌; 蔡晓玲; 胡金龙; 袁华; 董守玲
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-02-18
Filing date: 2019-02-18
Publication date: 2019-06-11

Abstract

The invention discloses a kind of Chinese electronic health records to name entity recognition method, comprising steps of 1) constructing popular word dictionary；2) brief part-of-speech tagging；3) text and part of speech DUAL PROBLEMS OF VECTOR MAPPING table are constructed；4) prediction model of training name entity；5) Tag Estimation of entity is named.The present invention is by being added part of speech feature, to improve the boundary ga s safety degree of name entity and popular word, to improve name entity boundary accuracy rate.Meanwhile the degree of correlation that each moment input and other compositions in sentence are calculated from attention mechanism is introduced in two-way LSTM-CRF model, to alleviate long Dependence Problem, improve name Entity recognition accuracy rate.

Description

A kind of Chinese electronic health record name entity recognition method

Technical field

The present invention relates to the technical fields of Chinese electronic health record name Entity recognition, refer in particular to a kind of Chinese electronic health record Name entity recognition method.

Background technique

Name Entity recognition (Named Entity Recognition, NER) in electronic health record, is from electronic health record Some clinical entities relevant to patient, such as the disease sites of patient, symptom, used drug are found out in descriptive text With operation etc..The name Entity recognition of Chinese electronic health record is the key that Chinese electronic health record information extraction, can be retrieved for case history, The Chinese health and fitness information processing work such as building of disease forecasting, medical knowledge map lays the foundation.But in electronic health record exist compared with More unregistered words, and quantity is continuously increased, moreover, comparing with English, the identification mission of Chinese name entity is more complicated.

The name Entity recognition difficult point of Chinese electronic health record is primarily present in: 1) Chinese text is without in similar English text The boundary marking in space etc accords with, therefore the first step of Entity recognition needs first to determine the boundary of name entity；2) Chinese word segmentation Task and name Entity recognition influence each other；3) different classes of name entity has different characteristics, it is more difficult to combine；4) Electronic health record is different from medical literature, and the ways of writing of unified standard, does not compare with personal presentation, one of the various abbreviations Form also increases difficulty for the Entity recognition of electronic health record.

In medical domain, earliest electronic health record name Entity recognition generallys use the method that dictionary is combined with rule. This method mostly uses the rule template of linguistic expertise and medical domain expert's joint mapping, selects punctuation mark, the noun of locality, position The methods of word, centre word are set, matching and string matching are main means in mode.Based on dictionary and rule method mostly according to Rely the building of rule base and dictionary, and as data set changes, it may be necessary to rebuild rule and dictionary to adapt to new number According to collection.Method based on machine learning is that correlated characteristic such as word feature, label information, part of speech are counted from sample data sets Information etc., to establish identification model.Two class algorithms are roughly divided into, first is that Entity recognition will be named as classification task, using base In the method such as Bayesian model, support vector machines, maximum entropy etc. of classification.Another kind of algorithm is Entity recognition will to be named as sequence Column mark task carries out Entity recognition using the models such as hidden Markov model (HMM) and condition random field (CRF).

With the development of deep learning, neural network is also applied to name Entity recognition task.Deep learning side Method can automatically extract text feature, be not necessarily to Feature Engineering.Current name Entity recognition deep learning model is largely passed Return neural network (Recurrent Neural Networks, RNN)+CRF model, it can using character vector or term vector To reach preferably effect, become the mainstream in the NER method currently based on deep learning.Wherein common shot and long term remembers net Network (Long short-term memory, LSTM) is used to automatically extract the contextual feature in text sequence, condition random field (Conditional random field, CRF) not only allows for the feature of input, while further comprising label transfer characteristic, The optimal sequence that model passes through training output mark.

The overwhelming majority only uses term vector or character vector as input, due to Chinese point based on the method for deep learning Word problem may introduce participle mistake using term vector as input, lead to Entity recognition mistake.Using character vector as defeated Enter, on the one hand cannot preferably express semantic information, on the other hand increase the length of name entity, improves name entity Boundary Extraction difficulty.In model construction, most of method handles input vector by LSTM, by selectively retaining history Information handles long Dependence Problem, but as sentence increases and the movement of time step, gradually seems unable to do what one wishes, it is difficult to study compared with The characteristic information of distant place cannot handle the Boundary Extraction problem of name entity well.

Summary of the invention

The purpose of the present invention is to overcome the shortcomings of the existing technology with it is insufficient, propose a kind of fusion part of speech and from attention The Chinese electronic health record of mechanism names entity recognition method, by the way that part of speech feature is added, to improve name entity and popular word Boundary ga s safety degree, thus improve name entity boundary accuracy rate.Meanwhile it is introduced in two-way LSTM-CRF model from note Meaning power mechanism, the degree of correlation for calculating each moment input and other compositions in sentence improve name to alleviate long Dependence Problem Entity recognition accuracy rate.

In name Entity recognition, the method based on machine learning is often using part-of-speech information as the important of name Entity recognition Feature, but part-of-speech information is rarely employed in deep learning method as feature, reason first is that currently to electronic health record Part-of-speech tagging is not mature enough, and there are more mistake in annotation results, the propagation of mistake causes Entity recognition effect poor.Needle To this problem, the invention proposes a kind of methods of brief part-of-speech tagging, by removing the word to doubtful name entity vocabulary Property mark, to avoid the error label to entity；The part-of-speech tagging to popular word is remained, again simultaneously to introduce name entity Context part-of-speech information and word boundary information.

To achieve the above object, a kind of technical solution provided by the present invention are as follows: Chinese electronic health record name Entity recognition Method, comprising the following steps:

1) it constructs popular word dictionary: to there is labeled data to segment, constructing popular word dictionary；

2) brief part-of-speech tagging: according to the popular word dictionary constructed in step 1), retaining the part-of-speech tagging of popular word, Remove the part of speech label of doubtful name entity vocabulary；

3) text and part of speech DUAL PROBLEMS OF VECTOR MAPPING table are constructed: using term vector training tool word2vec to the text for having labeled data This and part of speech are trained respectively, obtain text vector mapping table and part of speech DUAL PROBLEMS OF VECTOR MAPPING table；

4) prediction model of training name entity: the mapping table obtained using step 3), by the text for having labeled data and Part of speech is mapped to vector, and fusion part of speech is input to after splicing and from the model of attention mechanism, training obtains name entity Prediction model；

5) Tag Estimation of entity is named: according to the popular word dictionary constructed in step 1), in entity to be extracted The sub- medical record data of message carries out brief part-of-speech tagging；The mapping table obtained using step 3) is mapped the text of data and part of speech At vector；The Tag Estimation of entity is named using the prediction model that step 4) obtains.

In step 1), popular word dictionary is constructed, comprising the following steps:

1.1) using Chinese word segmentation tool to there is labeled data to segment；

1.2) judge whether each participle unit is within the scope of name entity, if it is, the participle unit is part Name entity vocabulary contains part entity vocabulary, without processing；If it is not, then it is general to illustrate that the participle unit belongs to Logical vocabulary, is added in dictionary, obtains popular word dictionary.

In step 2), brief part-of-speech tagging, comprising the following steps:

2.1) using Chinese part of speech annotation tool to have labeled data carry out part-of-speech tagging；

2.2) judge whether each mark unit appears in popular word dictionary, if it is, the mark unit is general Logical vocabulary, retains part of speech；If it is not, then illustrating that the mark unit may be divided into comprising part names entity vocabulary Character string, to avoid participle mistake, and marking each character part of speech is " s ", to reduce the part-of-speech tagging mistake of name entity；

2.3) the part-of-speech tagging result for having labeled data is obtained.

In step 3), the detailed process of text and part of speech DUAL PROBLEMS OF VECTOR MAPPING table is constructed: by the part-of-speech tagging knot in step 2) Fruit is separated into two parts of files, and portion is text sequence, contains word unit and character cell；Another is that text sequence is corresponding Part of speech sequence contains the part of speech of popular word and is divided into the entity part of speech " s " of character；Utilize term vector tool word2vec Two parts of files are respectively trained, obtain text vector mapping table and part of speech DUAL PROBLEMS OF VECTOR MAPPING table.

In step 4), the prediction model of training name entity, comprising the following steps:

4.1) text for having labeled data and part of speech are mapped to vector by the mapping table obtained according to step 3), are obtained every The text vector of a sentence: X={ x₁,x₂,x₃,...,x_mAnd corresponding part of speech vector: P={ p₁,p₂,p₃,...,p_m, wherein m It is sentence length, x_t∈R^lIndicate t-th of text vector, vector dimension l；p_t∈R^lIndicate x_tPart of speech vector, vector dimension For l；The corresponding part of speech vector of text vector in each sentence is spliced, the input vector of model: V=is obtained {X；P }, V={ v₁,v₂,v₃,...,v_m},v_t∈R^2lIndicate t-th of input vector, vector dimension 2l；

4.2) weight vectors of each unit to ingredients other in sentence, fusion in sentence from attention layer, are being calculated Current input and corresponding weight vectors, and recompiled using LSTM network, it obtains obtaining and merges sentence semantics and word The feature vector of property:

The input vector of t moment is calculated to the weight entirely inputted: c_t=att (V, v_t), specific calculating process is as follows:

Wherein v_t, v_jRespectively indicate the input of t moment and jth moment, W and W_v,It is function parameter,Indicate t The weighted value of moment input and the input of jth moment.The weighted value for indicating t moment input and the input of the i-th moment, carries out it Normalization is calculated Indicate the normalized weight value that t moment input inputted for the i-th moment, v_iIndicate the defeated of i moment Enter, c is calculated by the adduction to all moment_t, c_tIndicate the input vector of t moment to the normalized weight entirely inputted； Weight and current input are spliced into [v_t,c_t], and recompiled using LSTM network to being originally inputted, incorporate weight Information:

h_t=LSTM (h_t-1,[v_t,c_t])

Wherein h_t-1Indicate the output of last moment, v_tIndicate that t moment inputs, c_tIndicate t moment input with entirely it is defeated The weight entered, h_t∈R^kCorresponding each moment recompile after output, k is the dimension of the network concealed layer of LSTM.Therefore, it obtains From the output vector of attention layer: H={ h₁,h₂,h₃,...,h_m, wherein m is output sequence length；

4.3) text context characteristic information and part of speech contextual feature information are extracted using two-way LSTM neural network:

Q=BiLSTM (H)

Obtain BiLSTM layers of output are as follows: Q={ q₁,q₂,q₃,...,q_m, wherein m is output sequence length, q_t∈R^2kIt is right The output at BiLSTM network each moment is answered, k is the dimension of the network concealed layer of LSTM, because being two-way LSTM network, Output vector dimension is 2k；

4.4) BiLSTM layers of output is subjected to linear transformation, obtains emission probability matrix, be input to CRF layers, and according to The CRF layers of label transition probability matrix learnt calculate the corresponding optimal sequence label of list entries, by the sequence of maximum probability Name entity class sequence label as final output:

By the output sequence Q of step 4.3) by linear transformation, it is input to CRF layers:

P=QW_p+b_p

Wherein W_p∈R^2k×n, b_p∈RⁿIt is parameter to be learned in model, P ∈ R is obtained after linear transformation^m×n, wherein k is The dimension of the network concealed layer of BiLSTM, m are the length of list entries, and n is entity tag quantity.The P obtained after linear transformation is The emission probability matrix of CRF, wherein matrix element P_i,jIt indicates to input the probability for being marked as j-th of entity tag i-th；Mark Sign shift-matrix A ∈ R^n×nIt is that parameter matrix is acquired in model training, wherein matrix element A_i,jIndicate i-th of entity tag to The probability of j-th of entity tag transfer；According to the two probability matrixs, calculates in the case where list entries is V, obtain optimal The probability of sequence label y, specific calculating process are as follows:

Wherein, V indicates list entries；Y indicates optimal sequence label, the i.e. corresponding true tag sequence of current input sequence Column；M indicates the length of input,Indicate y_iLabel is to y_i+1The probability of label transfer,Indicate i-th of input unit quilt Labeled as y_iThe probability of label, s (V, y) indicate to calculate the score of sequence label y；Y indicates all sequence labels, to each in Y Sequence labelCalculate separately the score of the sequence labelSummation obtains the total score of all possible sequence labels, thus Obtain the normalization score p (y | V) of optimal sequence label y；Loss function of the negative logarithm of prediction probability as model is taken, training The prediction model of name entity is obtained, loss function is as follows:

L=-log (p (y | V)).

In step 5), the Tag Estimation of entity is named, comprising the following steps:

5.1) part-of-speech tagging is carried out using Chinese electronic health record data of the Chinese part of speech annotation tool to entity to be extracted；Root According to dictionary obtained in step 1), judge whether each mark unit appears in popular word dictionary, if it is, the mark Unit is popular word, retains part of speech；If it is not, then illustrating that the mark unit may be comprising part entity vocabulary, by its stroke It is divided into character string, and marking each character part of speech is " s "；

5.2) mapping table obtained according to step 3), by the text of step 5.1) part-of-speech tagging result and part of speech be mapped to Amount, obtains the text vector of each sentence: X={ x₁,x₂,x₃,...,x_mAnd corresponding part of speech vector: P={ p₁,p₂, p₃,...,p_m, wherein m is sentence length, x_tIndicate t text unit, p_tIndicate xth_tPart of speech；By the text in each sentence This vector, corresponding part of speech vector are spliced, and the input vector of model: V={ X is obtained；P }, V={ v₁,v₂, v₃,...,v_m},v_tIndicate that t input vector, m are input length；

5.3) vector of step 5.2) is input to prediction model obtained in step 4), in name entity to be extracted The sub- medical record data of message carries out entity tag prediction；Take the forecasting sequence of maximum probability as final annotation results:Wherein Y indicates the set of all possible sequence label, to each sequence label y in Y, meter It calculates at currently input V, obtains the normalization score p (y | V) of sequence label y, y^*Indicate the highest sequence label of score.

Compared with prior art, the present invention have the following advantages that with the utility model has the advantages that

1, it joined part of speech feature in the deep learning model of name entity, to enrich the grammar property of input.

2, medical bodies participle and part-of-speech tagging mistake are reduced using brief part-of-speech tagging method.

3, medical dictionary is not depended on, the work of neighborhood dictionary creation is reduced.

4, it joined the ability that the long Dependence Problem of model treatment is improved from attention mechanism in a model.

Detailed description of the invention

Fig. 1 is the flow chart of the method for the present invention.

Fig. 2 is the building flow chart of electronic health record normal dictionary.

Fig. 3 is brief part-of-speech tagging flow chart.

Fig. 4 is fusion part of speech and the deep learning illustraton of model from attention mechanism.

Specific embodiment

Present invention will now be described in further detail with reference to the embodiments and the accompanying drawings, but embodiments of the present invention are unlimited In this.

As shown in Figures 1 to 4, Chinese electronic health record provided by the present embodiment names entity recognition method, mainly melts Part of speech feature is closed and from attention mechanism.It is obtained general in data preprocessing phase by the method for reduction entity part-of-speech tagging The term vector and part-of-speech tagging of logical vocabulary, and the character vector and substitution part-of-speech tagging of name entity vocabulary.By text vector and Corresponding part-of-speech tagging vector is input to fusion part of speech and from the model of attention mechanism after being spliced, by from attention Layer calculates the weight vectors of each relatively entire sentence of moment input vector, to obtain semantic feature and the part of speech spy of sentence level Sign, is input in two-way LSTM network, obtains the text context characteristic information and part of speech contextual feature information of each input, Finally via CRF layers of acquisition annotation results.

The specific steps of the present invention are as follows:

Step 1, building popular word dictionary

1.1) using participle tool to there is labeled data to segment；

Such as: " in our hospital's row<entity>complete hysterectomy</entity>" word segmentation result be " in the full uterus of our hospital's row Resection ".

1.2) judge whether each participle unit is within the scope of name entity, if it is, the participle unit is name Entity vocabulary contains part names entity, without processing；If it is not, then illustrating that the participle unit belongs to generic word It converges, is added in dictionary, obtains popular word dictionary.

" " " our hospital " in such as example above does not include name entity part, is added in popular word dictionary." row This participle unit contains part names entity entirely ", is classified as name entity vocabulary, is added without popular word dictionary；" son Palace " is similar with " resection ", is not belonging to popular word.

Step 2, brief part-of-speech tagging: according to the popular word dictionary constructed in step 1, retain the part of speech mark of popular word Note removes the part of speech label of doubtful name entity vocabulary, specific as follows:

2.1) part-of-speech tagging is carried out to all data using Chinese part of speech annotation tool such as jieba.As example above marks Are as follows:

" _ p our hospital _ n row it is complete _ uterus n _ n resection _ l ", wherein " p, n, l " belong to part of speech, are expressed as being situated between Word, noun, idiom.

2.2) judge whether each mark unit appears in popular word dictionary, if it is, the mark unit is general Logical vocabulary, retains part of speech；If it is not, then illustrating that the mark unit may be divided into character comprising part names entity Sequence, and marking each character part of speech is " s "；Obtain annotation results are as follows:

" _ p our hospital _ n row _ s it is complete _ the s _ palace s _ s cuts _ s is except _ s art _ s "

Step 3, building text and part of speech DUAL PROBLEMS OF VECTOR MAPPING table: being separated into two parts of files for the part-of-speech tagging result in step 2, Portion is text sequence, contains word unit and character cell；Another is the corresponding part of speech sequence of text sequence, is contained general The part of speech of logical vocabulary and the entity part of speech " s " for being divided into character；Two parts of files are respectively trained using term vector tool word2vec, Obtain text vector mapping table and part of speech DUAL PROBLEMS OF VECTOR MAPPING table；

Step 4, training entity prediction model: the mapping table obtained using step 3, the text and part of speech that will have labeled data It is mapped to vector, fusion part of speech is input to after splicing and from the model of attention mechanism, training obtains entity prediction model, It is specific as follows:

4.1) text for having labeled data and part of speech are mapped to vector, obtained each by the mapping table obtained according to step 3 The text vector of sentence: X={ x₁,x₂,x₃,...,x_mAnd corresponding part of speech vector: P={ p₁,p₂,p₃,...,p_m, wherein m is Sentence length, x_t∈R^lIndicate t-th of text vector, vector dimension l；p_t∈R^lIndicate x_tPart of speech vector, vector dimension is l；The corresponding part of speech vector of text vector in each sentence is spliced, the input vector of model: V={ X is obtained； P }, V={ v₁,v₂,v₃,...,v_m},v_t∈R^2lIndicate t-th of input vector, vector dimension 2l.

4.2) the corresponding sequence vector of each sentence is input in model, by calculating t moment from attention layer Input vector is to the weight entirely inputted: c_t=att (V, v_t), specific calculating process is as follows:

Wherein v_t, v_jRespectively indicate the input of t moment and jth moment, W and W_v,It is function parameter,Indicate t The weighted value of moment input and the input of jth moment.The weighted value for indicating t moment input and the input of the i-th moment, carries out it Normalization is calculated Indicate the normalized weight value that t moment input inputted for the i-th moment, v_iIndicate the defeated of i moment Enter, c is calculated by the adduction to all moment_t, c_tIndicate the input vector of t moment to the normalized weight entirely inputted. Weight and current input are spliced into [v_t,c_t], and recompiled using LSTM network to being originally inputted, incorporate weight Information:

h_t=LSTM (h_t-1,[v_t,c_t])

Q=BiLSTM (H)

4.4) 4.3) output sequence is input to CRF layers by linear transformation:

P=QW_p+b_p

Wherein W_p∈R^2k×n, b_p∈RⁿIt is parameter to be learned in model, P ∈ R is obtained after linear transformation^m×n, wherein k is The dimension of the network concealed layer of BiLSTM, m are the length of list entries, and n is entity tag quantity.The P obtained after linear transformation is The emission probability matrix of CRF, wherein matrix element P_i,jIndicate that i-th of input marking is the probability of j-th of entity tag；Label Shift-matrix A ∈ R^n×nIt is that parameter matrix is acquired in model training, wherein matrix element A_i,jIndicate i-th of entity tag to The probability of j entity tag transfer；According to the two probability matrixs, calculates in the case where list entries is V, obtain optimal mark The probability of sequences y is signed, specific calculating process is as follows:

Wherein, V indicates list entries；Y indicates optimal sequence label, the i.e. corresponding true tag sequence of current input sequence Column；M indicates the length of input,Indicate y_iLabel is to y_i+1The probability of label transfer,Indicate i-th of input unit quilt Labeled as y_iThe probability of label, s (V, y) indicate to calculate the score of sequence label y；

Y indicates all sequence labels, to sequence label each in YCalculate separately the score of the sequence label Summation obtains the total score of all possible sequence labels, thus obtains the normalization score p (y | V) of optimal sequence label y；Take prediction Loss function of the negative logarithm of probability as model, training obtain the prediction model of name entity, and loss function is as follows:

L=-log (p (y | V))

Step 5, entity tag prediction

5.1) part-of-speech tagging is carried out using Chinese electronic health record data of the part-of-speech tagging tool to name entity to be extracted；Root According to dictionary obtained in step 1, judge whether each mark unit appears in popular word dictionary, if it is, the mark Unit is popular word, retains part of speech；If it is not, then illustrating that the mark unit may be comprising part entity vocabulary, by its stroke It is divided into character string, and marking each character part of speech is " s "；

5.2) 5.1) text of part-of-speech tagging result and part of speech are mapped to vector, obtained by the mapping table obtained according to step 3 To the text vector of each sentence: X={ x₁,x₂,x₃,...,x_mAnd corresponding part of speech vector: P={ p₁,p₂,p₃,...,p_m, Wherein m is sentence length, x_tIndicate t-th of text unit, p_tIndicate xth_tPart of speech；By the text vector in each sentence, and Corresponding part of speech vector is spliced, and the input vector of model: V={ X is obtained；P }, V={ v₁,v₂,v₃,...,v_m},v_t Indicate t-th of input vector.

5.3) 5.2) vector is input to prediction model obtained in step 4, to the middle message of name entity to be extracted Sub- medical record data carries out entity tag prediction.Take the forecasting sequence of maximum probability as final annotation results:Wherein Y indicates the set of all possible sequence label, to each sequence label y in Y, meter It calculates at currently input V, obtains the normalization score p (y | V) of sequence label y, y^*Indicate the highest sequence label of score.

The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, other any changes, modifications, substitutions, combinations, simplifications made without departing from the spirit and principles of the present invention, It should be equivalent substitute mode, be included within the scope of the present invention.

Claims

1. a kind of Chinese electronic health record names entity recognition method, which comprises the following steps:

2) brief part-of-speech tagging: according to the popular word dictionary constructed in step 1), retaining the part-of-speech tagging of popular word, removes The part of speech label of doubtful name entity vocabulary；

3) text and part of speech DUAL PROBLEMS OF VECTOR MAPPING table are constructed: using term vector training tool word2vec to the text for having labeled data and Part of speech is trained respectively, obtains text vector mapping table and part of speech DUAL PROBLEMS OF VECTOR MAPPING table；

4) prediction model of training name entity: the mapping table obtained using step 3), the text and part of speech that will have labeled data It is mapped to vector, fusion part of speech is input to after splicing and from the model of attention mechanism, training obtains the pre- of name entity Survey model；

5) Tag Estimation of entity is named: according to the popular word dictionary constructed in step 1), to the middle message of entity to be extracted Sub- medical record data carries out brief part-of-speech tagging；The mapping table obtained using step 3), by the text of data and part of speech be mapped to Amount；The Tag Estimation of entity is named using the prediction model that step 4) obtains.

2. a kind of Chinese electronic health record according to claim 1 names entity recognition method, it is characterised in that: in step 1) In, construct popular word dictionary, comprising the following steps:

1.2) judge whether each participle unit is within the scope of name entity, if it is, the participle unit is part names Entity vocabulary contains part entity vocabulary, without processing；If it is not, then illustrating that the participle unit belongs to generic word It converges, is added in dictionary, obtains popular word dictionary.

3. a kind of Chinese electronic health record according to claim 1 names entity recognition method, it is characterised in that: in step 2) In, brief part-of-speech tagging, comprising the following steps:

2.2) judge whether each mark unit appears in popular word dictionary, if it is, the mark unit is generic word It converges, retains part of speech；If it is not, then illustrating that the mark unit may be divided into character comprising part names entity vocabulary Sequence, to avoid participle mistake, and marking each character part of speech is " s ", to reduce the part-of-speech tagging mistake of name entity；

2.3) the part-of-speech tagging result for having labeled data is obtained.

4. a kind of Chinese electronic health record according to claim 1 names entity recognition method, it is characterised in that: in step 3) In, it constructs the detailed process of text and part of speech DUAL PROBLEMS OF VECTOR MAPPING table: the part-of-speech tagging result in step 2) is separated into two parts of files, Portion is text sequence, contains word unit and character cell；Another is the corresponding part of speech sequence of text sequence, is contained general The part of speech of logical vocabulary and the entity part of speech " s " for being divided into character；Two parts of files are respectively trained using term vector tool word2vec, Obtain text vector mapping table and part of speech DUAL PROBLEMS OF VECTOR MAPPING table.

5. a kind of Chinese electronic health record according to claim 1 names entity recognition method, it is characterised in that: in step 4) In, the prediction model of training name entity, comprising the following steps:

4.1) text for having labeled data and part of speech are mapped to vector, obtain each sentence by the mapping table obtained according to step 3) The text vector of son: X={ x₁,x₂,x₃,...,x_mAnd corresponding part of speech vector: P={ p₁,p₂,p₃,...,p_m, wherein m is sentence Sub- length, x_t∈R^lIndicate t-th of text vector, vector dimension l；p_t∈R^lIndicate x_tPart of speech vector, vector dimension l； The corresponding part of speech vector of text vector in each sentence is spliced, the input vector of model: V={ X is obtained；P}, V={ v₁,v₂,v₃,...,v_m},v_t∈R^2lIndicate t-th of input vector, vector dimension 2l；

4.2) from attention layer, each unit is to the weight vectors of ingredients other in sentence in calculating sentence, and fusion is currently Input and corresponding weight vectors, and are recompiled using LSTM network, are obtained obtaining and are merged sentence semantics and part of speech Feature vector:

Wherein v_t, v_jRespectively indicate the input of t moment and jth moment；W and W_v,It is function parameter,Indicate that t moment is defeated Enter the weighted value with the input of jth moment.The weighted value for indicating t moment input and the input of the i-th moment, is normalized it It is calculated Indicate the normalized weight value that t moment input inputted for the i-th moment, v_iIt indicates the input at i moment, leads to It crosses and c is calculated to the adduction at all moment_t, c_tIndicate the input vector of t moment to the normalized weight entirely inputted；It will power [v is spliced in weight and current input_t,c_t], and recompiled using LSTM network to being originally inputted, incorporate weight information:

h_t=LSTM (h_t-1,[v_t,c_t])

Wherein h_t-1Indicate the output of last moment, v_tIndicate that t moment inputs, c_tThe power for indicating t moment input and entirely inputting Weight, h_t∈R^kCorresponding each moment recompile after output, k is the dimension of the network concealed layer of LSTM；Therefore, it obtains paying attention to certainly The output vector of power layer: H={ h₁,h₂,h₃,...,h_m, wherein m is output sequence length；

Q=BiLSTM (H)

Obtain BiLSTM layers of output are as follows: Q={ q₁,q₂,q₃,...,q_m, wherein m is output sequence length, q_t∈R^2kIt is corresponding The output at BiLSTM network each moment, k are the dimension of the network concealed layer of LSTM, defeated because being two-way LSTM network Outgoing vector dimension is 2k；

4.4) BiLSTM layers of output is subjected to linear transformation, obtains emission probability matrix, is input to CRF layers, and according to CRF layers The label transition probability matrix learnt calculates the corresponding optimal sequence label of list entries, using the sequence of maximum probability as The name entity class sequence label of final output:

P=QW_p+b_p

Wherein W_p∈R^2k×n, b_p∈RⁿIt is parameter to be learned in model, P ∈ R is obtained after linear transformation^m×n, wherein k is The dimension of the network concealed layer of BiLSTM, m are the length of list entries, and n is entity tag quantity, and the P obtained after linear transformation is The emission probability matrix of CRF, wherein matrix element P_i,jIt indicates to input the probability for being marked as j-th of entity tag i-th；Mark Sign shift-matrix A ∈ R^n×nIt is that parameter matrix is acquired in model training, wherein matrix element A_i,jIndicate i-th of entity tag to The probability of j-th of entity tag transfer；According to the two probability matrixs, calculates in the case where list entries is V, obtain optimal The probability of sequence label y, specific calculating process are as follows:

Wherein, V indicates list entries；Y indicates optimal sequence label, the i.e. corresponding true tag sequence of current input sequence；M table Show the length of input,Indicate y_iLabel is to y_i+1The probability of label transfer,Indicate that i-th of input unit is marked as y_i The probability of label, s (V, y) indicate to calculate the score of sequence label y；Y indicates all sequence labels, to sequence label each in YCalculate separately the score of the sequence labelSummation obtains the total score of all possible sequence labels, thus obtains optimal The normalization score p (y | V) of sequence label y；Loss function of the negative logarithm of prediction probability as model is taken, training is named The prediction model of entity, loss function are as follows:

L=-log (p (y | V)).

6. a kind of Chinese electronic health record according to claim 1 names entity recognition method, it is characterised in that: in step 5) In, name the Tag Estimation of entity, comprising the following steps:

5.1) part-of-speech tagging is carried out using Chinese electronic health record data of the Chinese part of speech annotation tool to entity to be extracted；According to step It is rapid 1) obtained in dictionary, judge whether each mark unit appears in popular word dictionary, if it is, the mark unit It is popular word, retains part of speech；If it is not, then illustrating that the mark unit may be divided into comprising part entity vocabulary Character string, and marking each character part of speech is " s "；

5.2) text of step 5.1) part-of-speech tagging result and part of speech are mapped to vector by the mapping table obtained according to step 3), Obtain the text vector of each sentence: X={ x₁,x₂,x₃,...,x_mAnd corresponding part of speech vector: P={ p₁,p₂,p₃,..., p_m, wherein m is sentence length, x_tIndicate t text unit, p_tIndicate xth_tPart of speech；By the text vector in each sentence, Corresponding part of speech vector is spliced, and the input vector of model: V={ X is obtained；P }, V={ v₁,v₂,v₃,...,v_m},v_t Indicate t input vector；

5.3) vector of step 5.2) is input to prediction model obtained in step 4), to the middle message of name entity to be extracted Sub- medical record data carries out entity tag prediction；Take the forecasting sequence of maximum probability as final annotation results:Wherein Y indicates the set of all sequence labels, and to each sequence label y in Y, calculating is being worked as Under preceding input V, the normalization score p (y | V) of sequence label y, y are obtained^*Indicate the highest sequence label of score.