Based on deep learning and distributed semantic feature medical information extraction system and method
Technical field
The present invention relates to a kind of medical information extraction system based on deep learning and distributed semantic feature and its realizations
Method.
Background technology
Widely used health and fitness information technology results in the unprecedentedly inflated of electric health record (EHR) data.Electronic health record
Data not only have been used to support clinical manipulation task (for example, Clinical Decision Support Systems), while can also support a variety of face
Bed Task.Many important patient informations are dispersed in narrative medical text, but most computer application can only
Understand structural data.Therefore, patient information clinic natural language processing (Clinical important in medical text can be extracted
NLP technology) has been introduced in medical field, and there is many applications in which show great effectiveness.
According to the 6th comprehension of information meeting (MUC-6), it is intended to which the name entity of identification name entity boundary and type is known
Not (NER) technology has become a hot topic of natural language processing research and the research direction of relative maturity.In medical text
In processing, name Entity recognition (for example, disease name, medicine name, detection title etc.) be equally most basic processing step it
One.Many existing NLP systems have been used identifies medical concept, such as MEDLEE based on the method for dictionary and rule.
MEDLEE is the medical concept extraction system of Columbia Univ USA's exploitation and the most comprehensive medical treatment NLP of earliest and function
One of system.MetaMap systems are that National Library of Medicine (National Library of Medicine, NLM) is opened
The information extracting system towards biomedical text of hair.CTAKES be based on unstructured information Governance framework (UIMA) and
OpenNLP natural language processing Open-Source Tools packets.In recent years, medical information research institution has successively organized multiple Entity recognitions
Relevant international evaluation and test.I2b2 (the Center of Informatics for Integrating Biology in 2009
And the Bedside) tissue is absorbed in the evaluation and test of pharmaceutical therapeutic entity identification mission, and 2010, i2b2 has been organized specially again
It notes in symptom, treatment and the evaluation and test of medical treatment test Entity recognition task.Share/CLEF in 2013,2014 and 2015
The international evaluation and test such as Semantic Evaluation (SemEval) is absorbed in identification disease name entity and by its regularization to UMLS
On terminology bank.In i2b2 pharmaceutical therapeutic entity identification missions in 2009, most of troops that participate in are employed based on Medical Dictionary
With the method for artificial rule, such as the MedEx systems of U.S. Vanderbilt University exploitations.In the i2b2 of 2010
In evaluation and test, sponsor provides a bigger mark corpus, thus multiple participates in before troop and ranking 5 system all
Use the recognition methods based on machine learning.Team participating in the contest has used condition random field (Conditional Random
Fields, CRFs), structuring support vector machines (Structual Support Vector Machines, SSVMs) is simultaneously explored
A large amount of character representation method.
It is important there is an urgent need to be extracted from the clinical text of China at present with the rapid growth that China Electronics's case history is implemented
Patient information, to accelerate domestic clinical research.Scholars have begun working on Chinese clinical treatment Entity recognition task.Tall building
Wang Shikun of door university et al. identifies symptom, this kind of entity of the interpretation of the cause, onset and process of an illness in Ming and Qing Gu case using condition random field.
Xu Hua in 2004 et al. proposes a kind of Chinese word segmentation and the integrated approach of name Entity recognition, has been synchronized on Chinese medical text
Into the two tasks and improve respective accuracy rate.Lei Jianbo of Peking University et al. more fully compares several machines in normal service
Learning algorithm identifies the performance of clinical treatment entity in modern medicine medical treatment text when using different types of feature, compares
Algorithm includes support vector machines, maximum entropy, condition random field and structuring support vector machines.In conclusion in Chinese medical name
In Entity recognition task, current effort, which is concentrated mainly on, studies different machine learning algorithms and the combination of different types of feature
On.
In recent years, the natural language processing system based on deep learning (Deep learning) achieves significant progress.
This kind of system is more effective from text learning is not marked largely using unsupervised learning (unsupervised learning) technology
Character representation method.Deep learning is an active research field in machine learning, it is using deep-neural-network to obtain
To high level character representation method.In fields such as image procossing, speech recognition, machine translation, deep learning all achieves phase
Than in other methods more preferably performance.By deep-neural-network, NLP researcher no longer needs to take a significant amount of time for spy
Determine task optimization feature, then validity feature is obtained in text automatically from not marking largely.Researcher also found, based on deep layer
Term vector (word embedding) expression of neural network can not only obtain syntactic level another characteristic, can also obtain semanteme
Grade another characteristic, this feature can be applied effectively in general English NLP tasks, achieve apparent effect.For example,
The NLP systems based on deep-neural-network of Dr.Ronan Collobert exploitations, in part-of-speech tagging, phrase chunking, name entity
In the tasks such as identification, semantic character labeling, all obtain compared to the highest accuracy rate in existed system.
Term vector is the alternative route of traditional bag of words (bag of words) character representation method popular at present, will
The mapping of each word becomes the array of a floating number composition.The representation method of floating-point array can be preserved compared to classical pathway
More semantic informations.Conventional method uses the term vector generation method based on sequence.This method assert it is all in language material from
The sequence so occurred is positive example.For example, when it is 5 to take word window (window size), following word sequence is considered as a positive example:
X={ wL2,wL1,w0,wR1,wR2}
Wherein, W0 is current word, and WL2, WL1 are that word, WR1 are closed on the left of current word, and WR2 is to close on word on the right side of current word.
When running term vector generating algorithm, algorithm randomly chooses a word and replaces W0 to form a negative example sample, i.e.,:
X*={ wL2,wL1,w*,wR1,wR2}
Then term vector generating algorithm will optimize following ranking criteria, make its minimum:
MAX { 0,1-DNN (X)+DNN (X*)}
Meanwhile traditional deep-neural-network uses stochastic gradient descent algorithm, using the following formula undated parameter set:
θ=θ-λ Δsθ
Wherein, λ is study ratio, and ΔθIt is gradient.
Term vector training method of the tradition based on neural network, usually using the optimization object function based on language model.
In the training process of term vector, by constantly maximizing probability of occurrence of the reasonable word sequence in emperorship network model, into
And the parameter of neural network is adjusted, by way of transmitting backward, gradually the vector in modification training, finally obtains a maximum
Chemical combination manages the term vector of text sequence.Although training method can obtain a conjunction by optimizing the probability of language model in this
The term vector of reason, but have ignored the effect of existing knowledge base.It is general there is presently no one due to the diversity of general field
Knowledge base can cross the existing knowledge for covering every field.Therefore, it is impossible to domain knowledge is used for the training process of term vector.
Invention content
The purpose of the present invention is overcoming the shortcomings of the prior art, provide a kind of based on deep learning and distributed semantic
The medical information extraction system and its implementation of feature.
The purpose of the present invention is achieved through the following technical solutions:
It is based on deep learning and distributed semantic feature medical information extraction system, feature:Include preprocessing module, base
Term vector training module, massive medical knowledge base in language model strengthen study module and based on depth artificial neural networks
Medicine name Entity recognition module, the preprocessing module, for carrying out forbidden character cleaning, Chinese to medicine text big data
Character code is unified and generates the word table that next module term vector training uses, and word table is the word occurred in all texts
List;
The term vector training module based on language model, reads pretreated medical text, according to the window of reservation
Mouthful, generate positive example;Meanwhile negative example is generated using the mode of random replacement positive example center word, pass through one depth nerve net of training
Network to optimize the probability of language model target as an optimization, generates primary term vector;
The massive medical knowledge base strengthens study module, using primary term vector as starting point, uses another depth
By optimizing the prediction probability of medical knowledge base, reinforcement study is carried out to primary term vector for neural network, so as to generate medicine neck
The distributed semantic feature in domain;
The medicine name Entity recognition module based on depth artificial neural network is strengthened learning using massive medical knowledge base
Practise the distributed semantic character representation of the medical domain of training in module, the depth nerve net of one medicine name Entity recognition of training
Network identifies name entity important in medicine text.
Further, it is above-mentioned based on deep learning and distributed semantic feature medical information extraction system, the pre- place
It manages module and includes forbidden character filtering module, Chinese character code unified modules and word table generation module,
The forbidden character filtering module traverses text as unit of character, removes wherein invalid non-visible character;
The Chinese character code unified modules determine the Chinese character coding mode of input text according to setting;
The word table generation module as unit of unicode characters, generates word table, and word is generated in follow-up term vector in table
In the process, it is mapped as the term vector of floating number form.
Further, it is above-mentioned based on deep learning and distributed semantic feature medical information extraction system, the base
It is excellent that positive and negative example generation module, term vector deep neural network module and network are included in the term vector training module of language model
Change and training error monitoring module, the positive and negative example generation module for reading read statement, according to preset window, generate
Positive example, meanwhile, using the centre word method of random replacement positive example, generate respective negative example;
The positive example of generation is born example input network, calculates probability, and according to just by the term vector deep neural network module
The probability adjustment network of negative example;
The network optimization and training error monitoring module for the overall situation, optimize the probability of language model, and controlled training
Error in the process when reaching the end condition of training setting, terminates training, preservation model.
Further, it is above-mentioned based on deep learning and distributed semantic feature medical information extraction system, the sea
Medical knowledge base reinforcement study module is measured to include knowledge base standardized module, strengthen study deep neural network module and network
Optimization and error monitoring module, the knowledge base standardized module, the expression of entity in standardized knowledge library;
It is described to strengthen study deep neural network module, using the entity in knowledge base as input, use primary term vector
As feature, predicted in learning network is strengthened, and according to the situation of predicted value and knowledge base actual value, strengthen primary word to
Amount;
The network optimization and error monitoring module for the overall situation, optimize the probability of language model, and controlled training process
In error, reach training setting end condition when, terminate training, preservation model.
Further, it is above-mentioned based on deep learning and distributed semantic feature medical information extraction system, the base
Medicine name entity deep neural network module and sentence are included in the medicine name Entity recognition module of depth artificial neural network
The optimization of grade maximum likelihood and overflow control module, the medicine name entity deep neural network module read the sentence of input, make
Character representation is carried out with distributed significance characteristic, and inputs an Entity recognition network, is known according to small-scale mark language material training
The identification network of not various medicine name entities;
The statement level maximum likelihood optimization and overflow control module, occur in being trained for deep neural network model
Overflow error carries out approximate calculation.
Further, it is above-mentioned based on deep learning and distributed semantic feature medical information extraction system, institute's predicate
The optimization of sentence grade maximum likelihood and overflow control module are avoided in model training using maximum likelihood algorithm due to computer floating number
Expression range is limited and model training is caused to fail, and algorithm is:
First, to all input xiFind maximum input xmax=MAX (xi);
Then, it is converted in the following way:
To avoid the floating-point overflow problem during objective function optimization, robustness and the precision of model are improved.
Further, it is above-mentioned based on deep learning and distributed semantic feature medical information extraction system, using base
In the name entity identification algorithms of deep-neural-network, deep-neural-network is based on HardTanh letters comprising a convolutional layer, one
Several nonlinear transformation layers and multiple linear layers;
When calculating the class categories score of each word, take upper and lower in the range of a specific window size of target word
Cliction is by as input;For the word that neighbouring sentence-initial or sentence terminate, a pseudo- filling word is used to ensure all words
Input vector is regular length;Each word in input window is mapped to N-dimensional vector, and N is term vector dimension;Then,
Convolutional layer generates the globalization feature corresponding to concealed nodes;Finally, local feature and global characteristics are sent into a standard together
Radial networks back-propagation algorithm to be used to be trained;Wherein, loss function is defined as following statement level log-likelihood:
Wherein, S (X, T) is Sentence-level Likelihood Score when sequence label T is endowed input X;H(Tt-1,Tt) label Tt-1
To label TtGlobal transfer score;DNN(Xt,Tt) label TtIt is endowed input XtWhen deep-neural-network score.
The present invention is based on deep learning and distributed semantic feature medical information abstracting methods, include the following steps:
Using the negative example of centre word generation of random replacement input positive example;
The primary term vector of deep neural network training based on language model optimization;
Depth is carried out using medical knowledge base big data and strengthens study, obtains the distributed semantic table for medical domain
Show;
The Chinese medical name Entity recognition of deep-neural-network based on optimization statement level Maximum-likelihood estimation probability;
The approximate data that prevention of deep neural network model overflows;
Strengthen learning by depth, magnanimity Chinese medical knowledge base is attached to the process of unsupervised learning.
Still further, it is above-mentioned based on deep learning and distributed semantic feature medical information abstracting method, by advance
It manages module and denoising is carried out to medicine big data, coding is unified and generates word table;Mould is trained by the term vector based on language model
Block reads medical text, using pre-defined length of window, read statement is divided into the positive example of multiple input window, together
When, respective negative example, positive example and negative example are generated by the method for random replacement centre word and train artificial neuron in a term vector
Constantly finally there is maximization language model to train primary by network probabilistic forecasting and the cycle of challenge network parameter in network
Term vector;Strengthen study module by massive medical knowledge base to be initialized using primary term vector, and use primary term vector
It predicts the entry in mass knowledge library, is learnt by constantly strengthening, adjust primary term vector, finally obtain towards medical domain
Distributed semantic character representation;New artificial mark is read by the medicine name Entity recognition module based on depth artificial neural network
A small amount of language material, read statement is converted into distributed feature description, and predict entry using distributed semantic feature description
Mark, by constantly adjusting net coefficients, realize and known based on the medicine name entity of deep learning and distributed semantic feature
Not.
Still further, it is above-mentioned based on deep learning and distributed semantic feature medical information abstracting method, based on language
Say that the positive and negative example generation module in the term vector training module of model generates negative example using the mode of random replacement positive example centre word;
Term vector deep neural network module passes through positive and negative example learning training primary term vector, the network optimization and training error monitoring module
Model optimization is carried out, monitors network training error and training of judgement end condition;
Massive medical knowledge base is strengthened in study module, and knowledge base standardized module reads medical knowledge base entry, standard
Change knowledge base description;Strengthen the entry that study deep neural network module reads standardization, by compare neural network forecast with it is true
Knowledge base marks, and generates error signal, is learnt by strengthening, primary term vector is trained for the distributed language towards medical domain
Adopted feature;
In medicine name Entity recognition module based on depth artificial neural network, medicine name entity deep neural network module
Using the quotation manually marked on a small quantity, optimized by statement level maximum likelihood and the training of overflow control module can accurately identify doctor
The network of scientific name entity, and carry out effective model training overflow control.
The substantive distinguishing features and significant progress that technical solution of the present invention protrudes are mainly reflected in:
1. the unsupervised feature learning based on neural network and medical text big data, greatly alleviates manual features selection
Burden;Unsupervised feature learning does not need to a large amount of artificial mark, avoids time-consuming a large amount of artificial annotation process;
2. based on the unsupervised feature learning of medicine text big data, the coverage rate of feature in model is improved, compared to biography
System method has a distinct increment in recall rate;
3. term vector is generated using a large amount of un-annotated datas, so as to avoid the cumbersome spy in medicine natural language processing
Sign selection and evolutionary process;The existing mass knowledge library of medical domain is made full use of, existing knowledge is combined by strengthening study
Into deep learning algorithm, so as to effectively improve system performance;
4. for medicine text using the medicine name entity identification algorithms based on deep-neural-network, in Chinese medical text
It is assessed in mark corpus, achieves performance more higher than traditional method based on sequence labelling.
Description of the drawings
Fig. 1:The architecture principle schematic diagram of present system;
Fig. 2:The structure diagram of deep-neural-network.
Specific embodiment
The present invention, using the probability for generating language model as optimization aim, uses the big number of medicine text by deep learning method
According to the primary term vector of training;Based on massive medical knowledge base, second depth artificial neural network of training is strengthened by depth
Study, during mass knowledge library is attached to the feature learning of deep learning, so as to obtain very to the distribution of medical domain
Semantic feature;The deep learning method based on optimization statement level maximum likelihood probability is finally used to carry out Chinese medical name entity to know
Not.
As shown in Figure 1, based on deep learning and distributed semantic feature medical information extraction system, preprocessing module is included
1st, the term vector training module 2 based on language model, massive medical knowledge base strengthen study module 3 and manually refreshing based on depth
Medicine name Entity recognition module 4 through network, preprocessing module 1, for medicine text big data is carried out forbidden character cleaning,
Chinese character coding is unified and generates the word table that next module term vector training uses, and word table is the text occurred in all texts
The list of word;
Term vector training module 2 based on language model reads pretreated medical text, according to the window of reservation,
Generate positive example;Meanwhile negative example is generated using the mode of random replacement positive example center word, by training a deep neural network,
To optimize the probability of language model target as an optimization, primary term vector is generated;
Massive medical knowledge base strengthens study module 3, using primary term vector as starting point, uses another depth nerve
Network by optimizing the prediction probability of medical knowledge base, carries out reinforcement study, so as to generate medical domain to primary term vector
Distributed semantic feature;
Medicine name Entity recognition module 4 based on depth artificial neural network is strengthened learning using massive medical knowledge base
The distributed semantic character representation of the medical domain of training in module 3, the depth nerve net of one medicine name Entity recognition of training
Network identifies name entity important in medicine text.
Wherein, preprocessing module 1 includes forbidden character filtering module 101, Chinese character code unified modules 102 and the life of word table
Into module 103,
Forbidden character filtering module 101 traverses text as unit of character, removes wherein invalid non-visible character, including
Control character 0x0-0x1F in ascii code tables;
Chinese character code unified modules 102 determine the Chinese character coding mode of input text according to setting;Such as input text
It is encoded for GBK, is then converted into UTF-8 codings, follow-up system will read utf-8 form codings, and in follow-up system memory
Middle unification uses unicode;
Word table generation module 103 as unit of unicode characters, generates word table, and word was generated in follow-up term vector in table
Cheng Zhong is mapped as the term vector of floating number form.
Term vector training module 2 based on language model includes positive and negative example generation module 201, term vector deep neural network
Module 202 and the network optimization and training error monitoring module 203, the positive and negative example generation module 201 input language for reading
Sentence according to preset window, generates positive example, meanwhile, using the centre word method of random replacement positive example, generate respective negative example;
The positive example of generation is born example input network, calculates probability, and according to just by term vector deep neural network module 202
The probability adjustment network of negative example;
Network optimizes and training error monitoring module 203, for the overall situation, optimizes the probability of language model, and controlled training mistake
Error in journey when reaching the end condition of training setting, terminates training, preservation model.
Massive medical knowledge base strengthens study module 3 and includes knowledge base standardized module 301, strengthens study depth nerve net
Network module 302 and the network optimization and error monitoring module 303, the knowledge base standardized module 301, in standardized knowledge library
The expression of entity;
Strengthen study deep neural network module 302, using the entity in knowledge base as input, made using primary term vector
It is characterized, is predicted in learning network is strengthened, and according to predicted value and the situation of knowledge base actual value, strengthen primary term vector;
The network optimization and error monitoring module 303 for the overall situation, optimize the probability of language model, and controlled training process
In error, reach training setting end condition when, terminate training, preservation model.
Medicine name Entity recognition module 4 based on depth artificial neural network includes medicine name entity deep neural network mould
Block 401 and the optimization of statement level maximum likelihood and overflow control module 402, medicine name entity deep neural network module 401 are read
The sentence of input is taken, character representation is carried out, and input an Entity recognition network using distributed significance characteristic, according to small-scale
Mark the identification network that language material training identifies various medicine name entities;
Statement level maximum likelihood optimizes and overflow control module 402, occurs in being trained for deep neural network model
Overflow error carries out approximate calculation.
Statement level maximum likelihood optimizes and overflow control module 402 is using maximum likelihood algorithm, avoid in model training by
It is limited in computer floating number expression range and model training is caused to fail, algorithm is:
First, to all input xiFind maximum input xmax=MAX (xi);
Then, it is converted in the following way:
To avoid the floating-point overflow problem during objective function optimization, robustness and the precision of model are improved.
Using the name entity identification algorithms based on deep-neural-network, deep-neural-network include a convolutional layer, one
Nonlinear transformation layer and multiple linear layers based on HardTanh functions, as shown in Fig. 2, this structure is wide when functional
It is general to be used for a variety of NLP tasks.
When calculating the class categories score of each word, take upper and lower in the range of a specific window size of target word
Cliction is by as input;For the word that neighbouring sentence-initial or sentence terminate, a pseudo- filling word is used to ensure all words
Input vector is regular length;Each word in input window is mapped to N-dimensional vector, and N is term vector dimension;Then,
Convolutional layer generates the globalization feature corresponding to concealed nodes;Finally, local feature and global characteristics are sent into a standard together
Radial networks back-propagation algorithm to be used to be trained;Wherein, loss function is defined as following statement level log-likelihood:
Wherein, S (X, T) is Sentence-level Likelihood Score when sequence label T is endowed input X;H(Tt-1,Tt) label Tt-1
To label TtGlobal transfer score;DNN(Xt,Tt) label TtIt is endowed input XtWhen deep-neural-network score.
The present invention is based on deep learning and distributed semantic feature medical information abstracting methods, include the following steps:
Using the negative example of centre word generation of random replacement input positive example;
The primary term vector of deep neural network training based on language model optimization;
Depth is carried out using medical knowledge base big data and strengthens study, obtains the distributed semantic table for medical domain
Show;
The Chinese medical name Entity recognition of deep-neural-network based on optimization statement level Maximum-likelihood estimation probability;
The approximate data that effective prevention of deep neural network model overflows;
Strengthen learning by depth, magnanimity Chinese medical knowledge base is attached to the process of unsupervised learning.
Wherein, denoising is carried out to medicine big data by preprocessing module 1, coding is unified and generates word table;Based on language
The term vector training module 2 of model reads medical text, using pre-defined length of window, read statement is divided into multiple
The positive example of input window, meanwhile, respective negative example is generated by the method for random replacement centre word, positive example and negative example are in a word
Constantly by network probabilistic forecasting and the cycle of challenge network parameter in vector training artificial neural network, finally there is maximization language
Speech model training goes out primary term vector;Massive medical knowledge base is strengthened study module 3 and is initialized using primary term vector, and
Using the entry in primary term vector prediction mass knowledge library, learnt by constantly strengthening, adjust primary term vector, it is final to obtain
To the distributed semantic character representation towards medical domain;Medicine name Entity recognition module 4 based on depth artificial neural network
The a small amount of language material newly manually marked is read, read statement is converted into distributed feature using distributed semantic feature description and is retouched
It states, and predicts the mark of entry, by constantly adjusting net coefficients, realize based on deep learning and distributed semantic feature
Medicine name Entity recognition.
Positive and negative example generation module 201 in term vector training module 2 based on language model is used in random replacement positive example
The mode of heart word generates negative example;Term vector deep neural network module 202 passes through positive and negative example learning training primary term vector, network
Optimization and training error monitoring module 203 carry out model optimization, monitor network training error and training of judgement end condition;
Massive medical knowledge base is strengthened in study module 3, and knowledge base standardized module 301 reads medical knowledge base entry,
Standardized knowledge library describes;Strengthen the entry that study deep neural network module 302 reads standardization, by comparing neural network forecast
It is marked with true knowledge base, generates error signal, learnt by strengthening, primary term vector is trained for point towards medical domain
Cloth semantic feature;
In medicine name Entity recognition module 4 based on depth artificial neural network, medicine name entity deep neural network mould
Block 401 is optimized by statement level maximum likelihood using the quotation manually marked on a small quantity and 402 training of overflow control module being capable of essence
The really network of identification medicine name entity, and carry out effective model training overflow control.
As a professional extremely strong field, medical domain has standardization high, covers very extensive knowledge base.It opens
The two step training methods that send out a kind of innovative.In the first step, centre is obtained using the method based on optimization probabilistic language model
Term vector;In second step training, from the term vector of the first step, one neural network of design is known to have medicine by optimization
Library is known further to train existing term vector.Second step training using large-scale medical knowledge base as supervising and guiding, into
The medicine meaning of one's words of one-step optimization term vector represents, greatly optimizes the ability that term vector matrix expresses the medicine medicine meaning of one's words, makes
Obtained term vector can more accurately describe medical knowledge.Medicine term vector key technology is different from other general term vectors
Technology.
Chinese medical knowledge is that the valuable source of correct guidance is carried out to term vector.Some for arranging current medical domain are logical
With medical knowledge base, diagnosis term set, ICD10 and the doctor of Pharmacopoea Chinensis, Chinese such as comprising Common drugs relevant information
Learn diagnosis term dictionary LOINC Chinese editions etc..By arranging existing medical terminology library, obtain one and include widely used doctor
The Basic period structure of technics.
Due to starting late for Chinese medical research, Chinese medical knowledge base is relatively limited.Foreign countries are arranged to be widely used
30 common medical knowledge bases, collect more than 200 ten thousand relevant medical word entries, will and with the help of several domain experts
The medical terminology of English is translated as Chinese.
A problem for having medicine art knowledge base is coverage rate deficiency.The correlative study of medical domain proves, existing
Medical knowledge base can only probably cover 60% or so of medical domain essential term.Due to delaying for time, many new terms
It can not be updated in terminology bank with knowledge.Therefore, medical information extraction system is developed, in large-scale Chinese medical text
A large amount of clinical widely used medical terminology is extracted in this.Under the auxiliary of computerized algorithm, to the medical terminology of extraction into
Row is screened, and error correction is with the merging of existing knowledge base etc.;Finally, one is built based on having Chinese medical knowledge base,
International a variety of common medical terms libraries are supplement, and increase and be commonly used in clinic, but the medical terms not being included
Comprehensive medical domain knowledge base.
Medical knowledge be oriented to term vector optimization method, collect and arrange one comprising more than 300 ten thousand entries it is comprehensive in
Literary medical domain knowledge base.Knowledge base covers the common term of medical domain, including:Medicine name, disease name, detection knot
Fruit, surgical procedure, treatment means, adverse reaction etc..A deep neural network is designed, using knowledge base, to instructing on last stage
Experienced term vector is oriented optimization.
The input layer of network is the corresponding term vector of medical terminology by optimization god.Input layer is read on last stage according to optimization
The term vector of language model training, as the corresponding input vector of medical terminology.To each term, neural computing belongs to
The probability of each medicine classification (classification in above-mentioned 6), then by optimizing the prediction probability of medical terminology classification, to term vector into
Row orientation optimization.The structure of neural network is as follows:
1) medical terminology of input using existing term vector, is converted to input vector by input layer;
2) convolutional layer converts input vector by convolution, is mapped to the middle layer (300 implicit nodes) of fixed length;
3) middle layer after convolution by linear transformation layer, is mapped to first layer hidden layer (500 by linear transformation layer
Implicit node);
4) input using HardTan functions, is mapped to second layer hidden layer (500 implicit sections by nonlinear transformation layer
Point);
5) linear transformation layer according to the input of second layer hidden layer, is mapped to final output node layer (6);
According to the probability of output layer and true medical terminology classification, corresponding error signal is calculated, by passing backward
It broadcasts algorithm and adjusts entire neural network parameter, and the corresponding term vector of final adjustment.
Training method during model training, never marks training corpus and concentrates extraction 1/5th as verification collection
It closes.In parameter selection, setting study ratio (learning rate) 0.01, term vector latitude is 50, hidden layer interstitial content
Be set as 100 (we test concealed nodes number and are possible to from 50 to 150, and 100 achieve best effect, and more than 100
Without significantly improving), word window is taken to be set as 5.All deep-neural-network parameter application stochastic gradient descent algorithms and reversely
Propagation algorithm (back propagation) updates.For Chinese medical text, it is not used participle technique, but by individual Chinese character
Make an independent word, generate term vector.
Syntactic information is not only contained in term vector, has further included semantic information.After term vector has been obtained, to each
A word is calculated and the highest vocabulary of its similarity using cosine similarity (cosine similarity).In following example
In, first row is shown and other highest vocabulary of " one " similarity.It can be seen that it is mainly made of number and numeral-classifier compound.
In third row, the relevant medical nomenclature of human organ is mainly included.
One |
It is left |
Limb |
Larynx |
Three |
It is right |
Jaw |
Top |
Two |
It is double |
Lung |
Office |
Half |
Two |
Arm |
Nose |
0 |
On |
Wall |
Sinus |
Two |
And |
It states |
Chamber |
Number |
Have |
Noon |
Eyelid |
Have |
Before |
It is aobvious |
Gorge |
Compared with |
Pillow |
Neck |
Foot |
It is beautiful |
Under |
Stern |
Tears |
In conclusion the present invention proposes and a kind of identifies medical treatment based on the method for deep learning and distributed semantic feature
6 kinds of important informations in text, including:The information such as drug, detection, disease, surgical procedure, treatment means and adverse reaction.
Compared with conventional conditions random field (CRF) model, the method have the characteristics that:1) using a large amount of un-annotated datas come generate word to
Amount, so as to avoid cumbersome feature selecting and the evolutionary process in medicine natural language processing;2) medical domain is made full use of to show
Some mass knowledge libraries are attached to existing knowledge in deep learning algorithm by strengthening study, so as to effectively improve systematicness
Energy;3) for medicine text using the medicine name entity identification algorithms based on deep-neural-network, in Chinese medical text marking
It is assessed in corpus, achieves performance more higher than traditional method based on sequence labelling.
It is to be understood that:The above is only the preferred embodiment of the present invention, for the common of the art
For technical staff, without departing from the principle of the present invention, several improvements and modifications can also be made, these are improved and profit
Decorations also should be regarded as protection scope of the present invention.