CN105894088B

CN105894088B - Based on deep learning and distributed semantic feature medical information extraction system and method

Info

Publication number: CN105894088B
Application number: CN201610176409.8A
Authority: CN
Inventors: 吴永辉; 王璟琪
Original assignee: Suzhou Hebta Health Information Technology Co ltd
Current assignee: Digital China Health Technologies Co ltd; Shenzhou Hebote Medical Information Technology Suzhou Co Ltd
Priority date: 2016-03-25
Filing date: 2016-03-25
Publication date: 2018-06-29
Anticipated expiration: 2036-03-25
Also published as: CN105894088A

Abstract

The invention discloses based on deep learning and distributed semantic feature medical information extraction system and method, term vector training module, massive medical knowledge base comprising preprocessing module, based on language model strengthen study module and the medicine name Entity recognition module based on depth artificial neural network；By deep learning method using the probability for generating language model as optimization aim, the primary term vector of medicine text big data training is used；Based on massive medical knowledge base, second depth artificial neural network of training is strengthened learning, during mass knowledge library is attached to the feature learning of deep learning, so as to obtain very to the distributed semantic feature of medical domain by depth；The deep learning method based on optimization statement level maximum likelihood probability is finally used to carry out Chinese medical name Entity recognition.Term vector is generated using a large amount of un-annotated datas, so as to avoid cumbersome feature selecting and the evolutionary process in medicine natural language processing.

Description

Based on deep learning and distributed semantic feature medical information extraction system and method

Technical field

The present invention relates to a kind of medical information extraction system based on deep learning and distributed semantic feature and its realizations Method.

Background technology

Widely used health and fitness information technology results in the unprecedentedly inflated of electric health record (EHR) data.Electronic health record Data not only have been used to support clinical manipulation task (for example, Clinical Decision Support Systems), while can also support a variety of face Bed Task.Many important patient informations are dispersed in narrative medical text, but most computer application can only Understand structural data.Therefore, patient information clinic natural language processing (Clinical important in medical text can be extracted NLP technology) has been introduced in medical field, and there is many applications in which show great effectiveness.

According to the 6th comprehension of information meeting (MUC-6), it is intended to which the name entity of identification name entity boundary and type is known Not (NER) technology has become a hot topic of natural language processing research and the research direction of relative maturity.In medical text In processing, name Entity recognition (for example, disease name, medicine name, detection title etc.) be equally most basic processing step it One.Many existing NLP systems have been used identifies medical concept, such as MEDLEE based on the method for dictionary and rule. MEDLEE is the medical concept extraction system of Columbia Univ USA's exploitation and the most comprehensive medical treatment NLP of earliest and function One of system.MetaMap systems are that National Library of Medicine (National Library of Medicine, NLM) is opened The information extracting system towards biomedical text of hair.CTAKES be based on unstructured information Governance framework (UIMA) and OpenNLP natural language processing Open-Source Tools packets.In recent years, medical information research institution has successively organized multiple Entity recognitions Relevant international evaluation and test.I2b2 (the Center of Informatics for Integrating Biology in 2009 And the Bedside) tissue is absorbed in the evaluation and test of pharmaceutical therapeutic entity identification mission, and 2010, i2b2 has been organized specially again It notes in symptom, treatment and the evaluation and test of medical treatment test Entity recognition task.Share/CLEF in 2013,2014 and 2015 The international evaluation and test such as Semantic Evaluation (SemEval) is absorbed in identification disease name entity and by its regularization to UMLS On terminology bank.In i2b2 pharmaceutical therapeutic entity identification missions in 2009, most of troops that participate in are employed based on Medical Dictionary With the method for artificial rule, such as the MedEx systems of U.S. Vanderbilt University exploitations.In the i2b2 of 2010 In evaluation and test, sponsor provides a bigger mark corpus, thus multiple participates in before troop and ranking 5 system all Use the recognition methods based on machine learning.Team participating in the contest has used condition random field (Conditional Random Fields, CRFs), structuring support vector machines (Structual Support Vector Machines, SSVMs) is simultaneously explored A large amount of character representation method.

It is important there is an urgent need to be extracted from the clinical text of China at present with the rapid growth that China Electronics's case history is implemented Patient information, to accelerate domestic clinical research.Scholars have begun working on Chinese clinical treatment Entity recognition task.Tall building Wang Shikun of door university et al. identifies symptom, this kind of entity of the interpretation of the cause, onset and process of an illness in Ming and Qing Gu case using condition random field. Xu Hua in 2004 et al. proposes a kind of Chinese word segmentation and the integrated approach of name Entity recognition, has been synchronized on Chinese medical text Into the two tasks and improve respective accuracy rate.Lei Jianbo of Peking University et al. more fully compares several machines in normal service Learning algorithm identifies the performance of clinical treatment entity in modern medicine medical treatment text when using different types of feature, compares Algorithm includes support vector machines, maximum entropy, condition random field and structuring support vector machines.In conclusion in Chinese medical name In Entity recognition task, current effort, which is concentrated mainly on, studies different machine learning algorithms and the combination of different types of feature On.

In recent years, the natural language processing system based on deep learning (Deep learning) achieves significant progress. This kind of system is more effective from text learning is not marked largely using unsupervised learning (unsupervised learning) technology Character representation method.Deep learning is an active research field in machine learning, it is using deep-neural-network to obtain To high level character representation method.In fields such as image procossing, speech recognition, machine translation, deep learning all achieves phase Than in other methods more preferably performance.By deep-neural-network, NLP researcher no longer needs to take a significant amount of time for spy Determine task optimization feature, then validity feature is obtained in text automatically from not marking largely.Researcher also found, based on deep layer Term vector (word embedding) expression of neural network can not only obtain syntactic level another characteristic, can also obtain semanteme Grade another characteristic, this feature can be applied effectively in general English NLP tasks, achieve apparent effect.For example, The NLP systems based on deep-neural-network of Dr.Ronan Collobert exploitations, in part-of-speech tagging, phrase chunking, name entity In the tasks such as identification, semantic character labeling, all obtain compared to the highest accuracy rate in existed system.

Term vector is the alternative route of traditional bag of words (bag of words) character representation method popular at present, will The mapping of each word becomes the array of a floating number composition.The representation method of floating-point array can be preserved compared to classical pathway More semantic informations.Conventional method uses the term vector generation method based on sequence.This method assert it is all in language material from The sequence so occurred is positive example.For example, when it is 5 to take word window (window size), following word sequence is considered as a positive example：

X={ w_L2,w_L1,w₀,w_R1,w_R2}

Wherein, W0 is current word, and WL2, WL1 are that word, WR1 are closed on the left of current word, and WR2 is to close on word on the right side of current word. When running term vector generating algorithm, algorithm randomly chooses a word and replaces W0 to form a negative example sample, i.e.,：

X^*={ w_L2,w_L1,w^*,w_R1,w_R2}

Then term vector generating algorithm will optimize following ranking criteria, make its minimum：

MAX { 0,1-DNN (X)+DNN (X^*)}

Meanwhile traditional deep-neural-network uses stochastic gradient descent algorithm, using the following formula undated parameter set：

θ=θ-λ Δs_θ

Wherein, λ is study ratio, and Δ_θIt is gradient.

Term vector training method of the tradition based on neural network, usually using the optimization object function based on language model. In the training process of term vector, by constantly maximizing probability of occurrence of the reasonable word sequence in emperorship network model, into And the parameter of neural network is adjusted, by way of transmitting backward, gradually the vector in modification training, finally obtains a maximum Chemical combination manages the term vector of text sequence.Although training method can obtain a conjunction by optimizing the probability of language model in this The term vector of reason, but have ignored the effect of existing knowledge base.It is general there is presently no one due to the diversity of general field Knowledge base can cross the existing knowledge for covering every field.Therefore, it is impossible to domain knowledge is used for the training process of term vector.

Invention content

The purpose of the present invention is overcoming the shortcomings of the prior art, provide a kind of based on deep learning and distributed semantic The medical information extraction system and its implementation of feature.

The purpose of the present invention is achieved through the following technical solutions：

It is based on deep learning and distributed semantic feature medical information extraction system, feature：Include preprocessing module, base Term vector training module, massive medical knowledge base in language model strengthen study module and based on depth artificial neural networks Medicine name Entity recognition module, the preprocessing module, for carrying out forbidden character cleaning, Chinese to medicine text big data Character code is unified and generates the word table that next module term vector training uses, and word table is the word occurred in all texts List；

The term vector training module based on language model, reads pretreated medical text, according to the window of reservation Mouthful, generate positive example；Meanwhile negative example is generated using the mode of random replacement positive example center word, pass through one depth nerve net of training Network to optimize the probability of language model target as an optimization, generates primary term vector；

The massive medical knowledge base strengthens study module, using primary term vector as starting point, uses another depth By optimizing the prediction probability of medical knowledge base, reinforcement study is carried out to primary term vector for neural network, so as to generate medicine neck The distributed semantic feature in domain；

The medicine name Entity recognition module based on depth artificial neural network is strengthened learning using massive medical knowledge base Practise the distributed semantic character representation of the medical domain of training in module, the depth nerve net of one medicine name Entity recognition of training Network identifies name entity important in medicine text.

Further, it is above-mentioned based on deep learning and distributed semantic feature medical information extraction system, the pre- place It manages module and includes forbidden character filtering module, Chinese character code unified modules and word table generation module,

The forbidden character filtering module traverses text as unit of character, removes wherein invalid non-visible character；

The Chinese character code unified modules determine the Chinese character coding mode of input text according to setting；

The word table generation module as unit of unicode characters, generates word table, and word is generated in follow-up term vector in table In the process, it is mapped as the term vector of floating number form.

Further, it is above-mentioned based on deep learning and distributed semantic feature medical information extraction system, the base It is excellent that positive and negative example generation module, term vector deep neural network module and network are included in the term vector training module of language model Change and training error monitoring module, the positive and negative example generation module for reading read statement, according to preset window, generate Positive example, meanwhile, using the centre word method of random replacement positive example, generate respective negative example；

The positive example of generation is born example input network, calculates probability, and according to just by the term vector deep neural network module The probability adjustment network of negative example；

The network optimization and training error monitoring module for the overall situation, optimize the probability of language model, and controlled training Error in the process when reaching the end condition of training setting, terminates training, preservation model.

Further, it is above-mentioned based on deep learning and distributed semantic feature medical information extraction system, the sea Medical knowledge base reinforcement study module is measured to include knowledge base standardized module, strengthen study deep neural network module and network Optimization and error monitoring module, the knowledge base standardized module, the expression of entity in standardized knowledge library；

It is described to strengthen study deep neural network module, using the entity in knowledge base as input, use primary term vector As feature, predicted in learning network is strengthened, and according to the situation of predicted value and knowledge base actual value, strengthen primary word to Amount；

The network optimization and error monitoring module for the overall situation, optimize the probability of language model, and controlled training process In error, reach training setting end condition when, terminate training, preservation model.

Further, it is above-mentioned based on deep learning and distributed semantic feature medical information extraction system, the base Medicine name entity deep neural network module and sentence are included in the medicine name Entity recognition module of depth artificial neural network The optimization of grade maximum likelihood and overflow control module, the medicine name entity deep neural network module read the sentence of input, make Character representation is carried out with distributed significance characteristic, and inputs an Entity recognition network, is known according to small-scale mark language material training The identification network of not various medicine name entities；

The statement level maximum likelihood optimization and overflow control module, occur in being trained for deep neural network model Overflow error carries out approximate calculation.

Further, it is above-mentioned based on deep learning and distributed semantic feature medical information extraction system, institute's predicate The optimization of sentence grade maximum likelihood and overflow control module are avoided in model training using maximum likelihood algorithm due to computer floating number Expression range is limited and model training is caused to fail, and algorithm is：

First, to all input x_iFind maximum input x_max=MAX (x_i)；

Then, it is converted in the following way：

To avoid the floating-point overflow problem during objective function optimization, robustness and the precision of model are improved.

Further, it is above-mentioned based on deep learning and distributed semantic feature medical information extraction system, using base In the name entity identification algorithms of deep-neural-network, deep-neural-network is based on HardTanh letters comprising a convolutional layer, one Several nonlinear transformation layers and multiple linear layers；

When calculating the class categories score of each word, take upper and lower in the range of a specific window size of target word Cliction is by as input；For the word that neighbouring sentence-initial or sentence terminate, a pseudo- filling word is used to ensure all words Input vector is regular length；Each word in input window is mapped to N-dimensional vector, and N is term vector dimension；Then, Convolutional layer generates the globalization feature corresponding to concealed nodes；Finally, local feature and global characteristics are sent into a standard together Radial networks back-propagation algorithm to be used to be trained；Wherein, loss function is defined as following statement level log-likelihood：

Wherein, S (X, T) is Sentence-level Likelihood Score when sequence label T is endowed input X；H(T_t-1,T_t) label T_t-1 To label T_tGlobal transfer score；DNN(X_t,T_t) label T_tIt is endowed input X_tWhen deep-neural-network score.

The present invention is based on deep learning and distributed semantic feature medical information abstracting methods, include the following steps：

Using the negative example of centre word generation of random replacement input positive example；

The primary term vector of deep neural network training based on language model optimization；

Depth is carried out using medical knowledge base big data and strengthens study, obtains the distributed semantic table for medical domain Show；

The Chinese medical name Entity recognition of deep-neural-network based on optimization statement level Maximum-likelihood estimation probability；

The approximate data that prevention of deep neural network model overflows；

Strengthen learning by depth, magnanimity Chinese medical knowledge base is attached to the process of unsupervised learning.

Still further, it is above-mentioned based on deep learning and distributed semantic feature medical information abstracting method, by advance It manages module and denoising is carried out to medicine big data, coding is unified and generates word table；Mould is trained by the term vector based on language model Block reads medical text, using pre-defined length of window, read statement is divided into the positive example of multiple input window, together When, respective negative example, positive example and negative example are generated by the method for random replacement centre word and train artificial neuron in a term vector Constantly finally there is maximization language model to train primary by network probabilistic forecasting and the cycle of challenge network parameter in network Term vector；Strengthen study module by massive medical knowledge base to be initialized using primary term vector, and use primary term vector It predicts the entry in mass knowledge library, is learnt by constantly strengthening, adjust primary term vector, finally obtain towards medical domain Distributed semantic character representation；New artificial mark is read by the medicine name Entity recognition module based on depth artificial neural network A small amount of language material, read statement is converted into distributed feature description, and predict entry using distributed semantic feature description Mark, by constantly adjusting net coefficients, realize and known based on the medicine name entity of deep learning and distributed semantic feature Not.

Still further, it is above-mentioned based on deep learning and distributed semantic feature medical information abstracting method, based on language Say that the positive and negative example generation module in the term vector training module of model generates negative example using the mode of random replacement positive example centre word； Term vector deep neural network module passes through positive and negative example learning training primary term vector, the network optimization and training error monitoring module Model optimization is carried out, monitors network training error and training of judgement end condition；

Massive medical knowledge base is strengthened in study module, and knowledge base standardized module reads medical knowledge base entry, standard Change knowledge base description；Strengthen the entry that study deep neural network module reads standardization, by compare neural network forecast with it is true Knowledge base marks, and generates error signal, is learnt by strengthening, primary term vector is trained for the distributed language towards medical domain Adopted feature；

In medicine name Entity recognition module based on depth artificial neural network, medicine name entity deep neural network module Using the quotation manually marked on a small quantity, optimized by statement level maximum likelihood and the training of overflow control module can accurately identify doctor The network of scientific name entity, and carry out effective model training overflow control.

The substantive distinguishing features and significant progress that technical solution of the present invention protrudes are mainly reflected in：

1. the unsupervised feature learning based on neural network and medical text big data, greatly alleviates manual features selection Burden；Unsupervised feature learning does not need to a large amount of artificial mark, avoids time-consuming a large amount of artificial annotation process；

2. based on the unsupervised feature learning of medicine text big data, the coverage rate of feature in model is improved, compared to biography System method has a distinct increment in recall rate；

3. term vector is generated using a large amount of un-annotated datas, so as to avoid the cumbersome spy in medicine natural language processing Sign selection and evolutionary process；The existing mass knowledge library of medical domain is made full use of, existing knowledge is combined by strengthening study Into deep learning algorithm, so as to effectively improve system performance；

4. for medicine text using the medicine name entity identification algorithms based on deep-neural-network, in Chinese medical text It is assessed in mark corpus, achieves performance more higher than traditional method based on sequence labelling.

Description of the drawings

Fig. 1：The architecture principle schematic diagram of present system；

Fig. 2：The structure diagram of deep-neural-network.

Specific embodiment

The present invention, using the probability for generating language model as optimization aim, uses the big number of medicine text by deep learning method According to the primary term vector of training；Based on massive medical knowledge base, second depth artificial neural network of training is strengthened by depth Study, during mass knowledge library is attached to the feature learning of deep learning, so as to obtain very to the distribution of medical domain Semantic feature；The deep learning method based on optimization statement level maximum likelihood probability is finally used to carry out Chinese medical name entity to know Not.

As shown in Figure 1, based on deep learning and distributed semantic feature medical information extraction system, preprocessing module is included 1st, the term vector training module 2 based on language model, massive medical knowledge base strengthen study module 3 and manually refreshing based on depth Medicine name Entity recognition module 4 through network, preprocessing module 1, for medicine text big data is carried out forbidden character cleaning, Chinese character coding is unified and generates the word table that next module term vector training uses, and word table is the text occurred in all texts The list of word；

Term vector training module 2 based on language model reads pretreated medical text, according to the window of reservation, Generate positive example；Meanwhile negative example is generated using the mode of random replacement positive example center word, by training a deep neural network, To optimize the probability of language model target as an optimization, primary term vector is generated；

Massive medical knowledge base strengthens study module 3, using primary term vector as starting point, uses another depth nerve Network by optimizing the prediction probability of medical knowledge base, carries out reinforcement study, so as to generate medical domain to primary term vector Distributed semantic feature；

Medicine name Entity recognition module 4 based on depth artificial neural network is strengthened learning using massive medical knowledge base The distributed semantic character representation of the medical domain of training in module 3, the depth nerve net of one medicine name Entity recognition of training Network identifies name entity important in medicine text.

Wherein, preprocessing module 1 includes forbidden character filtering module 101, Chinese character code unified modules 102 and the life of word table Into module 103,

Forbidden character filtering module 101 traverses text as unit of character, removes wherein invalid non-visible character, including Control character 0x0-0x1F in ascii code tables；

Chinese character code unified modules 102 determine the Chinese character coding mode of input text according to setting；Such as input text It is encoded for GBK, is then converted into UTF-8 codings, follow-up system will read utf-8 form codings, and in follow-up system memory Middle unification uses unicode；

Word table generation module 103 as unit of unicode characters, generates word table, and word was generated in follow-up term vector in table Cheng Zhong is mapped as the term vector of floating number form.

Term vector training module 2 based on language model includes positive and negative example generation module 201, term vector deep neural network Module 202 and the network optimization and training error monitoring module 203, the positive and negative example generation module 201 input language for reading Sentence according to preset window, generates positive example, meanwhile, using the centre word method of random replacement positive example, generate respective negative example；

The positive example of generation is born example input network, calculates probability, and according to just by term vector deep neural network module 202 The probability adjustment network of negative example；

Network optimizes and training error monitoring module 203, for the overall situation, optimizes the probability of language model, and controlled training mistake Error in journey when reaching the end condition of training setting, terminates training, preservation model.

Massive medical knowledge base strengthens study module 3 and includes knowledge base standardized module 301, strengthens study depth nerve net Network module 302 and the network optimization and error monitoring module 303, the knowledge base standardized module 301, in standardized knowledge library The expression of entity；

Strengthen study deep neural network module 302, using the entity in knowledge base as input, made using primary term vector It is characterized, is predicted in learning network is strengthened, and according to predicted value and the situation of knowledge base actual value, strengthen primary term vector；

The network optimization and error monitoring module 303 for the overall situation, optimize the probability of language model, and controlled training process In error, reach training setting end condition when, terminate training, preservation model.

Medicine name Entity recognition module 4 based on depth artificial neural network includes medicine name entity deep neural network mould Block 401 and the optimization of statement level maximum likelihood and overflow control module 402, medicine name entity deep neural network module 401 are read The sentence of input is taken, character representation is carried out, and input an Entity recognition network using distributed significance characteristic, according to small-scale Mark the identification network that language material training identifies various medicine name entities；

Statement level maximum likelihood optimizes and overflow control module 402, occurs in being trained for deep neural network model Overflow error carries out approximate calculation.

Statement level maximum likelihood optimizes and overflow control module 402 is using maximum likelihood algorithm, avoid in model training by It is limited in computer floating number expression range and model training is caused to fail, algorithm is：

First, to all input x_iFind maximum input x_max=MAX (x_i)；

Then, it is converted in the following way：

Using the name entity identification algorithms based on deep-neural-network, deep-neural-network include a convolutional layer, one Nonlinear transformation layer and multiple linear layers based on HardTanh functions, as shown in Fig. 2, this structure is wide when functional It is general to be used for a variety of NLP tasks.

The approximate data that effective prevention of deep neural network model overflows；

Wherein, denoising is carried out to medicine big data by preprocessing module 1, coding is unified and generates word table；Based on language The term vector training module 2 of model reads medical text, using pre-defined length of window, read statement is divided into multiple The positive example of input window, meanwhile, respective negative example is generated by the method for random replacement centre word, positive example and negative example are in a word Constantly by network probabilistic forecasting and the cycle of challenge network parameter in vector training artificial neural network, finally there is maximization language Speech model training goes out primary term vector；Massive medical knowledge base is strengthened study module 3 and is initialized using primary term vector, and Using the entry in primary term vector prediction mass knowledge library, learnt by constantly strengthening, adjust primary term vector, it is final to obtain To the distributed semantic character representation towards medical domain；Medicine name Entity recognition module 4 based on depth artificial neural network The a small amount of language material newly manually marked is read, read statement is converted into distributed feature using distributed semantic feature description and is retouched It states, and predicts the mark of entry, by constantly adjusting net coefficients, realize based on deep learning and distributed semantic feature Medicine name Entity recognition.

Positive and negative example generation module 201 in term vector training module 2 based on language model is used in random replacement positive example The mode of heart word generates negative example；Term vector deep neural network module 202 passes through positive and negative example learning training primary term vector, network Optimization and training error monitoring module 203 carry out model optimization, monitor network training error and training of judgement end condition；

Massive medical knowledge base is strengthened in study module 3, and knowledge base standardized module 301 reads medical knowledge base entry, Standardized knowledge library describes；Strengthen the entry that study deep neural network module 302 reads standardization, by comparing neural network forecast It is marked with true knowledge base, generates error signal, learnt by strengthening, primary term vector is trained for point towards medical domain Cloth semantic feature；

In medicine name Entity recognition module 4 based on depth artificial neural network, medicine name entity deep neural network mould Block 401 is optimized by statement level maximum likelihood using the quotation manually marked on a small quantity and 402 training of overflow control module being capable of essence The really network of identification medicine name entity, and carry out effective model training overflow control.

As a professional extremely strong field, medical domain has standardization high, covers very extensive knowledge base.It opens The two step training methods that send out a kind of innovative.In the first step, centre is obtained using the method based on optimization probabilistic language model Term vector；In second step training, from the term vector of the first step, one neural network of design is known to have medicine by optimization Library is known further to train existing term vector.Second step training using large-scale medical knowledge base as supervising and guiding, into The medicine meaning of one's words of one-step optimization term vector represents, greatly optimizes the ability that term vector matrix expresses the medicine medicine meaning of one's words, makes Obtained term vector can more accurately describe medical knowledge.Medicine term vector key technology is different from other general term vectors Technology.

Chinese medical knowledge is that the valuable source of correct guidance is carried out to term vector.Some for arranging current medical domain are logical With medical knowledge base, diagnosis term set, ICD10 and the doctor of Pharmacopoea Chinensis, Chinese such as comprising Common drugs relevant information Learn diagnosis term dictionary LOINC Chinese editions etc..By arranging existing medical terminology library, obtain one and include widely used doctor The Basic period structure of technics.

Due to starting late for Chinese medical research, Chinese medical knowledge base is relatively limited.Foreign countries are arranged to be widely used 30 common medical knowledge bases, collect more than 200 ten thousand relevant medical word entries, will and with the help of several domain experts The medical terminology of English is translated as Chinese.

A problem for having medicine art knowledge base is coverage rate deficiency.The correlative study of medical domain proves, existing Medical knowledge base can only probably cover 60% or so of medical domain essential term.Due to delaying for time, many new terms It can not be updated in terminology bank with knowledge.Therefore, medical information extraction system is developed, in large-scale Chinese medical text A large amount of clinical widely used medical terminology is extracted in this.Under the auxiliary of computerized algorithm, to the medical terminology of extraction into Row is screened, and error correction is with the merging of existing knowledge base etc.；Finally, one is built based on having Chinese medical knowledge base, International a variety of common medical terms libraries are supplement, and increase and be commonly used in clinic, but the medical terms not being included Comprehensive medical domain knowledge base.

Medical knowledge be oriented to term vector optimization method, collect and arrange one comprising more than 300 ten thousand entries it is comprehensive in Literary medical domain knowledge base.Knowledge base covers the common term of medical domain, including：Medicine name, disease name, detection knot Fruit, surgical procedure, treatment means, adverse reaction etc..A deep neural network is designed, using knowledge base, to instructing on last stage Experienced term vector is oriented optimization.

The input layer of network is the corresponding term vector of medical terminology by optimization god.Input layer is read on last stage according to optimization The term vector of language model training, as the corresponding input vector of medical terminology.To each term, neural computing belongs to The probability of each medicine classification (classification in above-mentioned 6), then by optimizing the prediction probability of medical terminology classification, to term vector into Row orientation optimization.The structure of neural network is as follows：

1) medical terminology of input using existing term vector, is converted to input vector by input layer；

2) convolutional layer converts input vector by convolution, is mapped to the middle layer (300 implicit nodes) of fixed length；

3) middle layer after convolution by linear transformation layer, is mapped to first layer hidden layer (500 by linear transformation layer Implicit node)；

4) input using HardTan functions, is mapped to second layer hidden layer (500 implicit sections by nonlinear transformation layer Point)；

5) linear transformation layer according to the input of second layer hidden layer, is mapped to final output node layer (6)；

According to the probability of output layer and true medical terminology classification, corresponding error signal is calculated, by passing backward It broadcasts algorithm and adjusts entire neural network parameter, and the corresponding term vector of final adjustment.

Training method during model training, never marks training corpus and concentrates extraction 1/5th as verification collection It closes.In parameter selection, setting study ratio (learning rate) 0.01, term vector latitude is 50, hidden layer interstitial content Be set as 100 (we test concealed nodes number and are possible to from 50 to 150, and 100 achieve best effect, and more than 100 Without significantly improving), word window is taken to be set as 5.All deep-neural-network parameter application stochastic gradient descent algorithms and reversely Propagation algorithm (back propagation) updates.For Chinese medical text, it is not used participle technique, but by individual Chinese character Make an independent word, generate term vector.

Syntactic information is not only contained in term vector, has further included semantic information.After term vector has been obtained, to each A word is calculated and the highest vocabulary of its similarity using cosine similarity (cosine similarity).In following example In, first row is shown and other highest vocabulary of " one " similarity.It can be seen that it is mainly made of number and numeral-classifier compound. In third row, the relevant medical nomenclature of human organ is mainly included.

One	It is left	Limb	Larynx
				Three	It is right	Jaw	Top
Two	It is double	Lung	Office
				Half	Two	Arm	Nose
0	On	Wall	Sinus
				Two	And	It states	Chamber
Number	Have	Noon	Eyelid
				Have	Before	It is aobvious	Gorge
Compared with	Pillow	Neck	Foot
				It is beautiful	Under	Stern	Tears

In conclusion the present invention proposes and a kind of identifies medical treatment based on the method for deep learning and distributed semantic feature 6 kinds of important informations in text, including：The information such as drug, detection, disease, surgical procedure, treatment means and adverse reaction. Compared with conventional conditions random field (CRF) model, the method have the characteristics that:1) using a large amount of un-annotated datas come generate word to Amount, so as to avoid cumbersome feature selecting and the evolutionary process in medicine natural language processing；2) medical domain is made full use of to show Some mass knowledge libraries are attached to existing knowledge in deep learning algorithm by strengthening study, so as to effectively improve systematicness Energy；3) for medicine text using the medicine name entity identification algorithms based on deep-neural-network, in Chinese medical text marking It is assessed in corpus, achieves performance more higher than traditional method based on sequence labelling.

It is to be understood that：The above is only the preferred embodiment of the present invention, for the common of the art For technical staff, without departing from the principle of the present invention, several improvements and modifications can also be made, these are improved and profit Decorations also should be regarded as protection scope of the present invention.

Claims

1. based on deep learning and distributed semantic feature medical information extraction system, it is characterised in that：Include preprocessing module (1), the term vector training module (2) based on language model, massive medical knowledge base strengthen study module (3) and based on depth The medicine name Entity recognition module (4) of artificial neural network, the preprocessing module (1), for medicine text big data into The cleaning of row forbidden character, Chinese character coding is unified and generates the word table that next module term vector training uses, and word table is institute There is the list of the word occurred in text；

The term vector training module (2) based on language model, reads pretreated medical text, according to the window of reservation Mouthful, generate positive example；Meanwhile negative example is generated using the mode of random replacement positive example center word, pass through one depth nerve net of training Network to optimize the probability of language model target as an optimization, generates primary term vector；

The massive medical knowledge base strengthens study module (3), using primary term vector as starting point, uses another depth god Through network, by optimizing the prediction probability of medical knowledge base, reinforcement study is carried out to primary term vector, so as to generate medical domain Distributed semantic feature；

The medicine name Entity recognition module (4) based on depth artificial neural network is strengthened learning using massive medical knowledge base Practise the distributed semantic character representation of the medical domain of training in module (3), the depth god of one medicine name Entity recognition of training Through network, name entity important in medicine text is identified.

It is 2. according to claim 1 based on deep learning and distributed semantic feature medical information extraction system, feature It is：The preprocessing module (1) includes forbidden character filtering module (101), Chinese character code unified modules (102) and word table Generation module (103),

The forbidden character filtering module (101) traverses text as unit of character, removes wherein invalid non-visible character；

The Chinese character code unified modules (102) determine the Chinese character coding mode of input text according to setting；

The word table generation module (103) as unit of unicode characters, generates word table, and word is generated in follow-up term vector in table In the process, it is mapped as the term vector of floating number form.

It is 3. according to claim 1 based on deep learning and distributed semantic feature medical information extraction system, feature It is：The term vector training module (2) based on language model includes positive and negative example generation module (201), term vector depth god Through network module (202) and the network optimization and training error monitoring module (203), the positive and negative example generation module (201) is used In reading read statement, according to preset window, positive example is generated, meanwhile, using the centre word method of random replacement positive example, generation Respective negative example；

The term vector deep neural network module (202), by the positive example of generation bear example input network, calculate probability, and according to The probability adjustment network of positive and negative example；

The network optimization and training error monitoring module (203), for the overall situation, optimize the probability of language model, and control instruction Error during white silk when reaching the end condition of training setting, terminates training, preservation model.

It is 4. according to claim 1 based on deep learning and distributed semantic feature medical information extraction system, feature It is：The massive medical knowledge base strengthens study module (3) and includes knowledge base standardized module (301), reinforcement study depth Neural network module (302) and the network optimization and error monitoring module (303), the knowledge base standardized module (301), mark The expression of entity in standardization knowledge base；

It is described to strengthen study deep neural network module (302), using the entity in knowledge base as input, use primary term vector As feature, predicted in learning network is strengthened, and according to the situation of predicted value and knowledge base actual value, strengthen primary word to Amount；

The network optimization and error monitoring module (303) for the overall situation, optimize the probability of language model, and controlled training mistake Error in journey when reaching the end condition of training setting, terminates training, preservation model.

It is 5. according to claim 1 based on deep learning and distributed semantic feature medical information extraction system, feature It is：The medicine name Entity recognition module (4) based on depth artificial neural network includes medicine name entity depth nerve net Network module (401) and the optimization of statement level maximum likelihood and overflow control module (402), the medicine name entity depth nerve net Network module (401) reads the sentence of input, and character representation is carried out, and input an Entity recognition net using distributed significance characteristic Network identifies the identification network of various medicine name entities according to small-scale mark language material training；

The statement level maximum likelihood optimization and overflow control module (402), occur in being trained for deep neural network model Overflow error, carry out approximate calculation.

It is 6. according to claim 5 based on deep learning and distributed semantic feature medical information extraction system, feature It is：The statement level maximum likelihood optimization and overflow control module (402) are avoided using maximum likelihood algorithm in model training Since computer floating number expression range is limited and model training is caused to fail, algorithm is：

First, to all input x_iFind maximum input x_max=MAX (x_i)；

Then, it is converted in the following way：

It is 7. according to claim 1 based on deep learning and distributed semantic feature medical information extraction system, feature It is：Using the name entity identification algorithms based on deep-neural-network, deep-neural-network is based on comprising a convolutional layer, one The nonlinear transformation layer of HardTanh functions and multiple linear layers；

When calculating the class categories score of each word, the cliction up and down in the range of a specific window size of target word is taken By as input；For the word that neighbouring sentence-initial or sentence terminate, a pseudo- filling word is used to ensure the input of all words Vector is regular length；Each word in input window is mapped to N-dimensional vector, and N is term vector dimension；Then, convolution Layer generates the globalization feature corresponding to concealed nodes；Finally, local feature and global characteristics are sent into putting for standard together Network is penetrated so that back-propagation algorithm to be used to be trained；Wherein, loss function is defined as following statement level log-likelihood：

Wherein, S (X, T) is Sentence-level Likelihood Score when sequence label T is endowed input X；H(T_t-1,T_t) label T_t-1To label T_tGlobal transfer score；DNN(X_t,T_t) label T_tIt is endowed input X_tWhen deep-neural-network score.

8. a kind of be used to implement being taken out based on deep learning and distributed semantic feature medical information for system described in claim 1 Take method, it is characterised in that include the following steps：

Depth is carried out using medical knowledge base big data and strengthens study, the distributed semantic obtained for medical domain represents；

The approximate data that prevention of deep neural network model overflows；

It is 9. according to claim 8 based on deep learning and distributed semantic feature medical information abstracting method, feature It is：Denoising is carried out to medicine big data by preprocessing module (1), coding is unified and generates word table；By being based on language mould The term vector training module (2) of type reads medical text, using pre-defined length of window, read statement is divided into multiple The positive example of input window, meanwhile, respective negative example is generated by the method for random replacement centre word, positive example and negative example are in a word Constantly by network probabilistic forecasting and the cycle of challenge network parameter in vector training artificial neural network, finally there is maximization language Speech model training goes out primary term vector；Strengthen study module (3) by massive medical knowledge base to carry out initially using primary term vector Change, and the entry in mass knowledge library predicted using primary term vector, learnt by constantly strengthening, adjust primary term vector, Finally obtain the distributed semantic character representation towards medical domain；Known by the medicine name entity based on depth artificial neural network Other module (4) reads a small amount of language material newly manually marked, and read statement is converted into distribution using distributed semantic feature description The feature description of formula, and predict the mark of entry, by constantly adjusting net coefficients, realize based on deep learning and distribution The medicine name Entity recognition of semantic feature.

It is 10. according to claim 9 based on deep learning and distributed semantic feature medical information abstracting method, feature It is：Positive and negative example generation module (201) in term vector training module (2) based on language model is used in random replacement positive example The mode of heart word generates negative example；Term vector deep neural network module (202) passes through positive and negative example learning training primary term vector, net Network optimizes and training error monitoring module (203) carries out model optimization, monitors network training error and training of judgement end condition；

Massive medical knowledge base is strengthened in study module (3), and knowledge base standardized module (301) reads medical knowledge base entry, Standardized knowledge library describes；Strengthen the entry that study deep neural network module (302) reads standardization, it is pre- by comparing network It surveys and is marked with true knowledge base, generate error signal, learnt by strengthening, primary term vector is trained for towards medical domain Distributed semantic feature；

In medicine name Entity recognition module (4) based on depth artificial neural network, medicine name entity deep neural network module (401) using the quotation manually marked on a small quantity, optimized by statement level maximum likelihood and overflow control module (402) training is accurate It identifies the network of medicine name entity, and carries out model training overflow control.