CN109002436A

CN109002436A - Medical text terms automatic identifying method and system based on shot and long term memory network

Info

Publication number: CN109002436A
Application number: CN201810762297.3A
Authority: CN
Inventors: 赵孟海; 严志华
Original assignee: Shanghai Jinshida Weining Software Technology Co Ltd
Current assignee: Shanghai Jinshida Weining Software Technology Co Ltd
Priority date: 2018-07-12
Filing date: 2018-07-12
Publication date: 2018-12-14

Abstract

The invention discloses a kind of medical text terms automatic identifying method and system based on shot and long term memory network extracts medical terminology class entity from medicine text automatically and designs to realize.Medical text terms automatic identifying method the present invention is based on shot and long term memory network includes indicating text each in medicine text sentence using the word vector of pre-training, obtaining training data；Training data is input in two-way length memory network, the label classification of each text maximum probability in medicine text sentence is obtained；By the label classification of each text maximum probability, this output result is input in condition random field, calculates the maximum annotated sequence of joint probability using viterbi algorithm.The present invention has merged two-way length memory network and the respective advantage of condition random field in short-term, can effectively promote the accuracy rate of word mark.

Description

Medical text terms automatic identifying method and system based on shot and long term memory network

Technical field

The present invention relates to machine learning fields, and in particular to a kind of medical text terms based on shot and long term memory network from Dynamic recognition methods and system.

Background technique

Traditional medical terminology identifying system can be divided into the term identifying system of word-based storehouse matching and based on machine learning Medical terminology automatic recognition system.

The medical terminology automatic recognition system of word-based storehouse matching has the advantages that accurate rate is high, recognition speed is fast but right Medicine scale and quality have very high requirement, and can not identify that is, recall rate is often insufficient to the term for being not logged in dictionary.

Medical terminology automatic recognition system based on conventional machines learning method can learn medicine art from training data The contextual information of language, contextual information identify medical terminology, avoid dictionary pattern matching to be not logged in dictionary term without The situation of method identification, greatly increases recall rate, but accurate rate is often lower.

In view of above-mentioned, the designer is actively subject to research and innovation, to found a kind of doctor based on shot and long term memory network Text terms automatic identifying method and system are treated, makes it with more the utility value in industry.

Summary of the invention

In order to solve the above technical problems, the object of the present invention is to provide a kind of high precision rate, high recall rate based on length The medical text terms automatic identifying method and system of phase memory network.

The present invention is based on the medical text terms automatic identifying methods of shot and long term memory network, including,

Text each in medicine text sentence is indicated using the word vector of pre-training, obtains training data；

Training data is input in two-way length memory network, each text maximum probability in medicine text sentence is obtained Label classification；

By the label classification of each text maximum probability, this output result is input in condition random field, is calculated using Viterbi Method calculates the maximum annotated sequence of joint probability.

Further, word vector is obtained with the text vector training method of word2vec, the word vector matrix L of generation is n × m ties up matrix, and wherein n represents the number of words in dictionary, and m represents the dimension of each word vector, and usual m takes between 100 to 300 Value.

The present invention is based on the medical text terms automatic recognition systems of shot and long term memory network, comprising:

Word vector model unit, for indicating text each in medicine text sentence using the word vector of pre-training；

Two-way length in short-term cured for training data to be input in two-way length memory network by memory network unit Learn the label classification of each text maximum probability in text sentence；

Conditional random field models unit is input to item for this output result by the label classification of each text maximum probability In part random field, the maximum annotated sequence of joint probability is calculated using viterbi algorithm.

Further, it specifically includes:

Text input layer, text are inputted in the form that single word is split；

The character of input is mapped to the word vector of pre-training by matrix L by word vector embeding layer；

It is special to extract word vector embeding layer using LSTM layers forward, backward LSTM layers respectively for two-way length memory network layer in short-term Sign；

Condition random field layer integrates the information of two-way LSTM, and the information after integration will be exported medicine as input Text word for word marks part of speech.

According to the above aspect of the present invention, the present invention is based on the medical text terms automatic identifying method of shot and long term memory network and being System, has at least the following advantages:

Using two-way length, memory network, distributed by text each in medicine text indicate to be used as network the present invention in short-term Input, export the label classification of each word maximum probability.Memory network fully considers that the context of text is believed to two-way length in short-term Breath, is conceived to the maximization to each word tag classification；Condition random field more considers that the part of entire sentence is special The linear weighted combination of sign calculates joint probability, directly optimizes entire sequence.Using two-way length in short-term memory network and condition with Airport algorithm is jointly labeled word sequence.Compared to traditional algorithm, two-way length memory network and condition random in short-term have been merged The respective advantage in field, memory network can more fully utilize contextual information to two-way length in short-term, can effectively promote the standard of word mark True rate so that the accuracy rate that word is classified in sequence labelling greatly improves, namely improves the essence of medical terminology automatic recognition system True rate and recall rate.

The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And can be implemented in accordance with the contents of the specification, the following is a detailed description of the preferred embodiments of the present invention and the accompanying drawings.

Detailed description of the invention

Fig. 1 is that the present invention is based on the medical text terms automatic identifying method of shot and long term memory network and the two-way length of system The frame diagram of short-term memory network；

Fig. 2 is that the present invention is based on the medical text terms automatic identifying method of shot and long term memory network and the middle length of system When memory network unit F L₁-FL₅And BL₁-BL₅Detailed structure view.

Specific embodiment

With reference to the accompanying drawings and examples, specific embodiments of the present invention will be described in further detail.Implement below Example is not intended to limit the scope of the invention for illustrating the present invention.

In medicine class text, such as textbook, clinical guidelines, electronic health record, it all include a large amount of medical speciality terms, these To text structure, knowledge information extraction etc. all plays a significant role term.Medicine Key Term is divided into disease by us Shape (SYM), sign (SGN), position word (REG), organ (ORG), body fluid (BFL), checks (TES), drug at disease (DIS) (DRU), 23 vocabulary classifications such as operation (SUR).The automatic recognition problem of medical terminology is converted medicine text by this programme Word sequence labelling problem: using word sequence as observation sequence, the sequence that each affiliated term classification of text is constituted is as status switch.

Embodiment 1

An a kind of preferred embodiment of the medical text terms automatic identifying method based on shot and long term memory network of the present invention, Include:

The present embodiment citing will need largely to mark training data using long memory network and condition random field in short-term, In the artificial annotation process of word sequence of medicine text, common BIO scheme will be marked using word.

For example, ' symptom of diabetes has more drinks, more foods and diuresis.' it is identified by into following form:

Sugared B_dis

Urinate I_dis

Sick I_dis

O

Disease O

Shape O

There is O

More B_sym

Drink I_sym

、 O

More B_sym

Eat I_sym

And O

More B_sym

Urinate I_sym

。 O

In above-mentioned mark, disease (dis) and symptom (sym) class entity are marked by special mask method such as ' B_dis ' And go out, other useless vocabulary and symbol are then directly marked as ' O '.

Embodiment 2

An a kind of preferred embodiment of the medical text terms automatic recognition system based on shot and long term memory network of the present invention, Include:

As shown in Fig. 1 to 2, the memory network+conditional random field models building in short-term of two-way length

Model programming framework: Python Tensorflow

Model training data: a large amount of medicine in step 1 mark text

Mode input: medicine text word for word input model

Model output: medicine text word for word marks part of speech

Model framework: text input layer, word vector embeding layer, memory network layer, condition random field layer are defeated in short-term for two-way length Out shown in layer following structure chart arrangement from the bottom to top:

1. the model bottom is Chinese character input layer, the form input model that text is split with single word.

2.E₁-E₅For word vector embeding layer, by matrix L by the character of input map to the word of pre-training in step 2 to Amount.

3.FL₁-FL₅It is LSTM layers forward, for extracting E₁-E₅Feature.

4.BL₁-BL₅It is LSTM layers backward, for extracting E₁-E₅Feature.

5.O₁-O₅Output layer is integrated for the information of two-way LSTM, while will be as subsequent CRF layers of input.

6.C₁-C₅It is CRF layers.

7. the top output layer final for model, for predicting the label of input layer character.

Long memory network unit F L in short-term in model₁-FL₅And BL₁-BL₅Detailed construction introduction:

In LSTM cellular construction figure, x_tFor t moment mode input, h_tIt is exported for t moment model, since LSTM belongs to circulation Neural network, h_tAlso it can become the input of next timing node t+1, i.e. t+1 moment unit receives input [h_t,x_t]。c_tFor list First state, for saving long term state.σ is sigmoid function, and tanh is hyperbolic tangent function.W_fTo forget door weight matrix, W_iFor input gate weight matrix, W_oFor out gate weight matrix, W_cFor active cell state c_tNewly-added information weight matrix.

In the various embodiments described above, word vector reflection be positional relationship of the word in semantic space, the cosine in space away from From signifying to correspond to the semantic similarity between word.This programme uses the text vector training method of word2vec, big by introducing The medicine text (medical text books, clinical guidelines, electronic health record etc.) of amount carries out the training of word vector, generate the word of higher-dimension to Amount, with relative positional relationship of the response word in semantic vector space.The word vector matrix L ultimately generated is that n × m ties up matrix, Wherein n represents the number of words in dictionary, and m represents the dimension of each word vector, usual m value between 100 to 300.

For the present invention since the word based on long memory network in short-term marks task, output layer is the label of each word maximum probability Classification is finally stitched together, and has ignored sentence Global Information, often will appear many mistakes or unreasonable label.In order to It is modified, this programme uses condition random field further to calculate the joint probability of annotated sequence, is calculated using Viterbi Method finds the maximum status switch of joint probability.Condition random field is conceived to the optimization of entire sequence, directs at mark task most Whole target can so correct the partial error or unreasonable label of the long output of memory network in short-term, improve term and know automatically Other accuracy rate.Finally, above system achieves 89% accuracy rate in validation data set.

The above is only a preferred embodiment of the present invention, it is not intended to restrict the invention, it is noted that for this skill For the those of ordinary skill in art field, without departing from the technical principles of the invention, can also make it is several improvement and Modification, these improvements and modifications also should be regarded as protection scope of the present invention.

Claims

1. a kind of medical text terms automatic identifying method based on shot and long term memory network, which is characterized in that including,

Training data is input in two-way length memory network, the mark of each text maximum probability in medicine text sentence is obtained Sign classification；

By the label classification of each text maximum probability, this output result is input in condition random field, uses viterbi algorithm meter Calculate the maximum annotated sequence of joint probability.

2. the medical text terms automatic identifying method according to claim 1 based on shot and long term memory network, feature It is, obtain word vector with the text vector training method of word2vec, the word vector matrix L of generation is that n × m ties up matrix, Middle n represents the number of words in dictionary, and m represents the dimension of each word vector, usual m value between 100 to 300.

3. a kind of medical text terms automatic recognition system based on shot and long term memory network characterized by comprising

Two-way length memory network unit in short-term obtains medicine text for training data to be input in two-way length memory network The label classification of each text maximum probability in this sentence；

Conditional random field models unit, for by the label classification of each text maximum probability this output result be input to condition with In airport, the maximum annotated sequence of joint probability is calculated using viterbi algorithm.

4. the medical text terms automatic recognition system according to claim 1 based on shot and long term memory network, feature It is, specifically includes:

Text input layer, text are inputted in the form that single word is split；

Two-way length memory network layer in short-term extracts word vector embeding layer feature using LSTM layers forward, backward LSTM layers respectively；

Condition random field layer integrates the information of two-way LSTM, and the information after integration will be exported medicine text as input Word for word mark part of speech.