CN110059185A

CN110059185A - A kind of medical files specialized vocabulary automation mask method

Info

Publication number: CN110059185A
Application number: CN201910265223.3A
Authority: CN
Inventors: 王嫄; 高铭; 王栋; 赵婷婷; 赵青; 陈亚瑞; 史艳翠; 孔娜; 王洁
Original assignee: Tianjin University of Science and Technology
Current assignee: Beijing Contention Technology Co ltd
Priority date: 2019-04-03
Filing date: 2019-04-03
Publication date: 2019-07-26
Anticipated expiration: 2039-04-03
Also published as: CN110059185B

Abstract

The present invention relates to a kind of medical files specialized vocabularies to automate mask method, comprising: carries out data prediction to the medical files of input, obtains pretreated medical files text；It obtains the alphabetical grade feature vector of word, word level feature vector, language feature vector and is merged, the coding vector as word；The word mark classification of medical files text after participle is obtained into labeled data collection；Space representation of one multi-C vector as word is exported to each word；Obtain enhanced labeled data collection；It is trained modeling, and final output annotation results.The present invention has rational design, it uses semi-supervised learning algorithm to be labeled a large amount of unlabeled data, successfully overcome the very few defect of existing medical industry labeled data, effectively improve the data volume that model is able to use, and algorithm is substantially improved for the mark accuracy rate of keyword and specialized vocabulary, it can be widely used in medical literature processing.

Description

A kind of medical files specialized vocabulary automation mask method

Technical field

The invention belongs to machine learning techniques field, especially a kind of medical files specialized vocabulary automates mask method.

Background technique

With the development of medical research community, all there can be more and more paper publishings to come out every year.People increasingly need Find the improved method for paper, and the key idea in automatic understanding these papers.However, due to various necks Domain and extremely limited annotation resource, it is relatively fewer to the extraction of scientific information.

Meanwhile as people increase sharply for the demand of medical resource, corresponding medical files and case quantity, cause to study Personnel and medical staff need quickly to arrange the past medical data of patient.It can quickly be helped from patient's case Often some professional vocabulary or the keyword that medical staff judges are helped, these vocabulary of manual sorting and keyword need Time that will be very more, since manpower limits, it is impossible to soon complete a large amount of cases, the housekeeping of medical data.

In conclusion how to be marked automatically to specialized vocabulary or keyword with the rising for medical resource demand Note is to promote medical staff for case, the processing speed of medical data and to help them preferably and be Case treatment be to compel at present Cut problem to be solved.

Summary of the invention

It is an object of the invention to overcome the deficiencies in the prior art, propose a kind of medical files specialized vocabulary automation mark Method uses semi-supervised learning algorithm to expand data volume, overcomes previous medical text marking data volume deficiency and leads The poor problem of the model performance of cause, and finally improve the accuracy for identifying specialized vocabulary and keyword in the text.

The present invention solves its technical problem and adopts the following technical solutions to achieve:

A kind of medical files specialized vocabulary automation mask method, comprising the following steps:

Step 1 carries out data prediction to the medical files of input, obtains pretreated medical files text；

Step 2 models text using biLSTM, obtains the alphabetical grade feature vector of word；

Step 3 models text using word2vec, obtains the word level feature vector of word；

Step 4 is based on text language pragmatic Features, obtains the language feature vector of word；

Step 5, alphabetical grade feature vector, word level feature vector and the language that step 2, step 3 and step 4 are obtained to word Feature vector is merged, the coding vector as word；

The word of medical files text after participle is labeled as following four classes medicine entity: disease name, disease by step 6 Symptom, treatment means and medicine name, every one kind entity indicate specific location of the word in the entity with IOBES, are marked Data set；

Step 7, using the coding vector of the text that step 1 obtains and the word that step 5 obtains as the input of biLSTM, it is right Each word exports space representation of the multi-C vector as word；

Step 8 obtains enhanced labeled data collection using label propagation algorithm extension labeled data collection；

Step 9, using the multi-C vector of step 7 as the space representation of word as the vector of word, the enhancing that step 8 is obtained Labeled data collection input condition random field afterwards is trained modeling, and final output annotation results.

Further, the concrete methods of realizing of the step 1 are as follows: the medical files of input are segmented first, form one A array stores each word and punctuation mark in text, then removes stop words, finally extracts stem and lemmatization, obtains To the citation form of word, and constitute the word array not marked.

Further, the concrete methods of realizing of the step 2 are as follows: using biLSTM to pretreated medical files text Alphabetical grade feature is encoded, and is encoded using five letters of head of each word, finally show that length is the alphabetical grade of 5d Feature vector.

Further, the concrete methods of realizing of the step 3 are as follows: using the Word2Vec algorithm of Google to pretreated The word level feature of medical files text is encoded, finally obtain length be d the word level feature for each word to Amount.

Further, the concrete methods of realizing of the step 4 are as follows: according to text language pragmatic Features, using manual definition side Method, to the pretreated following feature of medical files text definition: initial capital and small letter, word all-lowercase, word are all big It writes, part of speech and syntactic structure, the feature vector that formation length is 21, each feature are indicated with 0 or 1.

Further, the concrete methods of realizing of the step 5 are as follows: by alphabetical grade feature vector, word level feature vector and language Speech feature vector links together, and forms the feature vector for the synthesis for each word that a length is 5d+d+21.

Further, the labeled data collection of the step 6 is the combination tag for including 20 classifications.

Further, the concrete methods of realizing of the step 7 are as follows: the combination formed using three kinds of features that step 5 obtains is special Vector is levied, and all feature vectors of entire word array are arranged, forms training data matrix, the number of the row of the matrix Amount is the quantity of the word in word array, and matrix column number is 5d+d+21；Using biLSTM, by forwardly and rearwardly calculating The hidden layer of journey passes to linear layer as input, which is 20 by the size of dimensional projections to tag types space, and As CRF layers of input.

Further, the concrete methods of realizing of the step 8 are as follows: firstly, based on feature vector structure figures corresponding to word, And as the node in figure, their distance and weight w are defined using the similarity between feature vector_uv, figure interior joint Sum is equal to the sum of Unlabeled data and marked data；Then, it is minimized using label propagation algorithm by optimization The objective function of Kullback- Leibler distance is distributed the label between adjacent node as similar to each other as possible, finally makes It obtains the corresponding word of all figure interior joints to be marked, obtains enhanced data set.

Further, the concrete methods of realizing of the step 9 are as follows: the space representation of the word for the multidimensional for obtaining step 7 as The vector of word, it includes the probability distribution for each label that biLSTM, which eventually exports mark matrix P, the P mark matrix, It is poured into CRF layers, obtains an annotated sequence y, the score φ (y of sequence of calculation y；X, θ), then calculate annotated sequence y and exist The probability P occurred in all annotated sequences_θ(y | x), finally using backpropagation for objective function log-It is maximized, to complete supervised learning, while the CRF model is exported as final result.

The advantages and positive effects of the present invention are:

1, the keyword in medical literature is divided into disease name (disease), symptom (symptom), treatment by the present invention Means (treatment-method) and medicine name (Drug-name) these four classifications, and it is based on semi-supervised learning mark side Method carries out the mark in specialized vocabulary for medical files or case, can be in the case where manpower and material resources consume extremely low, for doctor The content in text is understood quickly in shield personnel or scholar, preferably makes medical decision making or research.

2, the present invention is labeled a large amount of unlabeled data using semi-supervised learning algorithm, successfully overcomes existing doctor Treat the very few defect of industry labeled data, effectively improve the data volume that model is able to use, and be substantially improved algorithm for The mark accuracy rate of keyword and specialized vocabulary can be widely used in medical literature processing.

Detailed description of the invention

Fig. 1 is process flow diagram of the invention.

Specific embodiment

The embodiment of the present invention is further described below in conjunction with attached drawing.

Design philosophy of the invention: utilize machine learning algorithm and technology, and based on semi-supervised learning mask method for Medical files or case carry out the mark in specialized vocabulary.The neural network that the present invention constructs one three layers of layering is come pair Text is marked: (1) word in text carries out the feature extraction of vectorization using three kinds of modes, and BiLSTM, which is extracted, is based on word Female feature, Word2Vec do word insertion, and the feature extraction based on syntactic structure to word.(2) BiLSTM is extracted same In one sentence, the contextual information being centered around around word, and encoded.(3) CRF target is used in combination in CRF mark layer Function makes final label judgement to word and label tag modeling.

Based on above-mentioned design philosophy, medical files specialized vocabulary of the invention automates mask method, as shown in Figure 1, packet Include following steps:

Step 1: data prediction being carried out to the medical files of input, obtains pretreated medical files text.

In this step, it inputs as medical files, exports as word array.Data preprocessing method are as follows: to medical files It is segmented first, forms an array, store each word and punctuation mark in text, then remove stop words, such as is ", " but ", " shall ", " by " extract stem and lemmatization later, obtain the citation form of word.For example, running, Ran, runs after extracting stem, obtain run word, and lemmatization is substantially similar, any type of vocabulary can be reduced to General type, the word array not marked for obtaining being made of general type by data prediction.

Step 2: modeling text using BiLSTM, obtain the alphabetical grade feature vector of word.

The input of this step be carry out data prediction after word array, export for the feature based on alphabetic feature to Amount, length 5d.

The present invention encodes alphabetic feature using biLSTM, referred to as Character-BasedEmbedding. The alphabetical grade feature of word is generated by propagating forward for BiLSTM with the hidden layer vector in calculating process backward, and building is based on character Embeding layer the advantages of be that it can handle vocabulary and formula except vocabulary, these vocabulary and formula in these data very Common, the feature vector length of generation is set as d.But the present invention using stem 5-gram (i.e. take word from a left side Turn right, 5 most started letter encodes, if without 5 letters, remaining length makees zero padding processing) feature is extracted, finally Feature vector length is 5d.

Step 3: modeling text using word2vec, obtain the word level feature vector of word.

In this step, it is mapped to vector space using the word of fixed vocabulary (marking plus unknown words), made Initialized with the Word2vec pre-training combined with different corpus, word using Google Word2Vec algorithm into Row coding, referred to as WordEmbedding finally show that the feature vector length for each word is d.

Step 4: being based on text language pragmatic Features, design obtains the language feature vector of word.

In this step, the word array only to be segmented to original text is inputted, is exported to be designed based on language feature Feature vector, length 21.Feature is not trained individually, is manual definition, referred to as FeatureEmbedding.The part is fixed The feature of justice includes: initial capital and small letter, and word all-lowercase, word all Caps, part of speech and syntactic structure etc. are 21 total Feature, being formed by feature vector length is 21, and each feature is indicated with 0 or 1, indicates them either with or without corresponding feature.

Step 5: alphabetical grade feature, word level feature and the language feature of the word that step 2,3,4 obtain being merged, made For the coding vector of word.

The input of this step be the feature vector of character level, the feature vector of word level, word language feature vector vector, Above-mentioned three feature vectors are linked together, formed a length be 5d+d+21 the synthesis for each word feature to Amount.

Step 6: the word of the electronic health record text after participle labeled data: being labeled as four class medicine entity (disease, diseases Shape, treatment means, medicine name), every one kind entity indicates specific location of the word in the entity with IOBES, is labeled as 20 altogether Class obtains labeled data collection.

In order to the span for two continuous key phrases for distinguishing same type, the present invention is each word in sentence Specified label specifies their positions and type in phrase.On the basis of preprocessed data, on each word mark Corresponding label, the position of phrase and corresponding classification, indicate their position, the present invention uniformly makes first where indicating them One word is described with IOBES (Inside, Outside, Begining, End and Singleton) in professional phrase or word Position in remittance, I indicate word in the inside of phrase, B indicate they in the beginning of phrase, E indicate they in the end of phrase, S indicates that this word is an individual specialized vocabulary, and 0 indicates to be included in sentence in the outside of the phrase.Simultaneously in conjunction with this The classification of these specialized vocabularies and phrase tagging, including disease name (disease), symptom (symptom) are treated hand by invention Section (treatment-method), medicine name (Drug-name) is comprehensive to form a complete label, for example, Criticialill patients (Critical Ill Patient) is noted as " B-symptom I-symptom E-symptom ".In this way The combination tag of formation just has altogether 20 classifications.Since training set quantity is very huge, the present invention is labelled with a portion Data thus constitute in data set labeled data and unlabeled data.

Step 7: using the coding vector of the word in text pretreated in step 1 and step 5 as the defeated of biLSTM Enter, set 20 for output, space representation of the vector of one 20 dimension as word is exported to each word.

In this step, the assemblage characteristic vector that the three kinds of features obtained using step 5 are formed, and by entire word array All feature vectors arrange, formed training data matrix, the quantity of the row of matrix is the quantity of the word in word array, Matrix column number is 5d+d+21, and the present invention uses biLSTM, is passed by the hidden layer of forwardly and rearwardly calculating process as input Linear layer is passed, which, by dimensional projections to the size in tag types space, is 20, and is used as CRF layers of input.

Step 8: data enhancing being carried out to the labeled data collection that step 6 obtains, extends mark number using label propagation algorithm Enhanced labeled data collection is obtained according to collection.

This step includes two parts: first part is based on feature vector structure figures corresponding to word, and as in figure Node, their distance and weight w are defined using the similarity between feature vector_uv, the sum of figure interior joint is not equal to The sum of flag data and marked data.Second part is using label propagation algorithm, which is intended to minimize by optimization The objective function of Kullback- Leibler distance is distributed the label between adjacent node as similar to each other as possible.The part The corresponding word of all figure interior joints is marked, enhanced data set is obtained.Method particularly includes:

First part constructs relational graph required for label propagation algorithm first, the feature vector institute on the vertex and word in figure It is corresponding, while being the distance between word feature, capture Semantic Similarity.The total size of figure is equal to flag data amount V_lWith it is unmarked Data volume V_uThe sum of.With the one group of word trained in advance insertion, (dimension d) modeling, wherein 5-gram is with 5, head of current word Letter, the word closest to verb are embedded in, and one including part group of part of speech label and capital and small letter (43 and 4 dimensions One-hot vector).Then the gained eigenvector projection that length is 5d+d+43+4 is tieed up to 100 using PCA dimension-reduction algorithm. The weight w on the side between definition node u and v of the present invention_uvIt is as follows: w_uv=d_e(u, v) ifv ∈ κ (u) oru ∈ κ (v), wherein κ (u) It is the set of the k-nearest neighbours of u, d_e(u, v) is the Euclidean distance in figure between any two node u and v.

For each node i in figure, the present invention calculates marginal probability { q using forwardly and rearwardly calculating process_i}.If θ_i Indicate the estimation of CRF parameter after n-th iteration, the present invention is for each mark position in sentence j in label and Unlabeled data The IOBES label of i calculates marginal probability

Second part enhances data using label propagation algorithm, unlabeled data in data set is labeled, the calculation Method is intended to minimize the objective function of Kullback-Leibler distance by optimization, is distributed the label between adjacent node to the greatest extent May be similar to each other, by minimizing following Kullback-Leibler distance: i) for the node of words all in figure, reference numerals According to distribution r_uQ is distributed with the label of prediction_u.Ii) all node u and its adjacent node v, q in figure_uAnd q_vDistribution.Iii) institute It is distributed the q of node_uWith CRF marginal probabilityIf node is not attached to marked vertex, Section 3 advises prediction distribution Then turn to CRF prediction, it is ensured that algorithm is good at least as standard self training.

Step 9: using the space representation of the word of 20 dimensions of step 7 as the vector of word, the enhanced mark that step 8 is obtained Note data set input condition random field is trained modeling.

Using the space representation of the word of 20 dimensions of step 7 as the vector of word, biLSTM eventually exports a mark matrix P, P matrix have included the probability distribution for each label substantially, are then poured into CRF layers, can obtain a mark Sequences y passes through the score φ (y of method sequence of calculation y；X, θ), then calculate what annotated sequence y occurred in all annotated sequences Probability P_θ(y | x), finally using backpropagation for objective functionIt is maximized, with Supervised learning is completed, while the CRF model is also exported as final result.The specific method is as follows:

Keyword classification be it is a kind of between output label there are the task of strong dependency for example, I-disease cannot be Behind B-treatment-method, therefore, the present invention does not make independent label decision to each output, uses item Part random field carries out joint modeling to them.For inputting sentence x=(x1, x2, x3 ..., xn), it is considered herein that P is The score matrix of biLSTM network output.The size of P is n × m, and wherein n is the reference numerals in sentence, and m is the number of not isolabeling Amount.P_{T, i}The score for i-th of label for dying a word corresponding in sentence the.The present invention is using first-order Markov model and determines Adopted transition matrix T, wherein T_{I, j}Table is from label i to the score of label j.The present invention increases y simultaneously₀And y_nAs beginning and end Identifier.Therefore the dimension of T matrix becomes m+2.

Score is defined as by a given possible output y and neural network parameter θ, the present invention

The probability of sequences y is obtained by applying softmax on all possible sequence label:

P_θ(y | x)=exp (φ (y；X, θ))/∑_y′∈Yexp(φ(y′；X, θ))

Wherein Y indicates all possible sequence label.Normalization item can be effectively calculated using forwards algorithms.

Initial training finally is carried out using the label data in data set, during the training period, the present invention maximizes given Log-probabilityL (the Y of the correct sequence label of corpus { X, Y }；X, θ).Simultaneously based on the gross score for using sentence The gradient of calculating completes backpropagation.

After obtaining trained CRF algorithm, by it in conjunction with characteristic extraction part constructed before, so that it may start pair It is labeled in text.Input a sentence x=(x₁, x₂, x₃..., x_n), an annotated sequence y=(y can be obtained₁, y₂, y₃..., y_n)。

Meaning of parameters in above-mentioned formula is described as follows:

The present invention does not address place and is suitable for the prior art.

It is emphasized that embodiment of the present invention be it is illustrative, without being restrictive, therefore packet of the present invention Include and be not limited to embodiment described in specific embodiment, it is all by those skilled in the art according to the technique and scheme of the present invention The other embodiments obtained, also belong to the scope of protection of the invention.

Claims

1. a kind of medical files specialized vocabulary automates mask method, it is characterised in that the following steps are included:

Step 5, alphabetical grade feature vector, word level feature vector and the language feature that step 2, step 3 and step 4 are obtained to word Vector is merged, the coding vector as word；

The word of medical files text after participle is labeled as following four classes medicine entity by step 6: disease name, disease symptoms, Treatment means and medicine name, every one kind entity indicate specific location of the word in the entity with IOBES, obtain labeled data Collection；

Step 7, using the coding vector of the text that step 1 obtains and the word that step 5 obtains as the input of biLSTM, to each A word exports space representation of the multi-C vector as word；

Step 9, using the multi-C vector of step 7 as the space representation of word as the vector of word, step 8 is obtained enhanced Labeled data collection input condition random field is trained modeling, and final output annotation results.

2. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step Rapid 1 concrete methods of realizing are as follows: the medical files of input are segmented first, form an array, are stored every in text Then a word and punctuation mark remove stop words, finally extract stem and lemmatization, obtain the citation form of word, and structure At the word array not marked.

3. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step Rapid 2 concrete methods of realizing are as follows: encoded, made using alphabetical grade feature of the biLSTM to pretreated medical files text It is encoded with five letters of head of each word, finally show that length is the alphabetical grade feature vector of 5d.

4. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step Rapid 3 concrete methods of realizing are as follows: using the Word2Vec algorithm of Google to the word level of pretreated medical files text Feature is encoded, and the word level feature vector for each word that length is d is finally obtained.

5. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step Rapid 4 concrete methods of realizing are as follows: according to text language pragmatic Features, using manual definition method, to pretreated medicine text The shelves following feature of text definition: initial capital and small letter, word all-lowercase, word all Caps, part of speech and syntactic structure are formed The feature vector that length is 21, each feature are indicated with 0 or 1.

6. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step Rapid 5 concrete methods of realizing are as follows: alphabetical grade feature vector, word level feature vector and language feature vector are linked together, Form the feature vector for the synthesis for each word that a length is 5d+d+21.

7. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step Rapid 6 labeled data collection is the combination tag for including 20 classifications.

8. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step Rapid 7 concrete methods of realizing are as follows: the assemblage characteristic vector that the three kinds of features obtained using step 5 are formed, and by entire word number All feature vectors of group are arranged, and form training data matrix, the quantity of the row of the matrix is the word in word array Quantity, matrix column number are 5d+d+21；Using biLSTM, passed by the hidden layer of forwardly and rearwardly calculating process as input Linear layer is passed, which is 20 by the size of dimensional projections to tag types space, and is used as CRF layers of input.

9. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step Rapid 8 concrete methods of realizing are as follows: firstly, based on feature vector structure figures corresponding to word, and as the node in figure, it uses Similarity between feature vector defines their distance and weight w_uv, the sum of figure interior joint is equal to Unlabeled data and The sum of flag data；Then, the target letter of Kullback-Leibler distance is minimized by optimization using label propagation algorithm Number is distributed the label between adjacent node as similar to each other as possible, marks the corresponding word of all figure interior joints Note, obtains enhanced data set.

10. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: described The concrete methods of realizing of step 9 are as follows: for the space representation of the word for the multidimensional for obtaining step 7 as the vector of word, biLSTM is final It includes the probability distribution for each label that mark matrix P, the P mark matrix, which can be exported, is poured into CRF layers, obtains Score φ (the y of an annotated sequence y out, sequence of calculation y；X, θ), then calculate annotated sequence y and occur in all annotated sequences Probability P_θ(y | x), finally using backpropagation for objective functionIt is maximized, To complete supervised learning, while the CRF model is exported as final result.