CN110059185A - A kind of medical files specialized vocabulary automation mask method - Google Patents
A kind of medical files specialized vocabulary automation mask method Download PDFInfo
- Publication number
- CN110059185A CN110059185A CN201910265223.3A CN201910265223A CN110059185A CN 110059185 A CN110059185 A CN 110059185A CN 201910265223 A CN201910265223 A CN 201910265223A CN 110059185 A CN110059185 A CN 110059185A
- Authority
- CN
- China
- Prior art keywords
- word
- medical files
- vector
- feature vector
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 239000013598 vector Substances 0.000 claims abstract description 76
- 238000013480 data collection Methods 0.000 claims abstract description 14
- 239000011159 matrix material Substances 0.000 claims description 18
- 201000010099 disease Diseases 0.000 claims description 11
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 11
- 239000003814 drug Substances 0.000 claims description 9
- 208000024891 symptom Diseases 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 8
- 230000015572 biosynthetic process Effects 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000005457 optimization Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 239000000284 extract Substances 0.000 claims description 3
- 238000003786 synthesis reaction Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 abstract description 4
- 230000007547 defect Effects 0.000 abstract description 2
- 238000009510 drug design Methods 0.000 abstract 1
- 230000006870 function Effects 0.000 description 5
- 238000000605 extraction Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000007812 deficiency Effects 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 210000003739 neck Anatomy 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to a kind of medical files specialized vocabularies to automate mask method, comprising: carries out data prediction to the medical files of input, obtains pretreated medical files text;It obtains the alphabetical grade feature vector of word, word level feature vector, language feature vector and is merged, the coding vector as word;The word mark classification of medical files text after participle is obtained into labeled data collection;Space representation of one multi-C vector as word is exported to each word;Obtain enhanced labeled data collection;It is trained modeling, and final output annotation results.The present invention has rational design, it uses semi-supervised learning algorithm to be labeled a large amount of unlabeled data, successfully overcome the very few defect of existing medical industry labeled data, effectively improve the data volume that model is able to use, and algorithm is substantially improved for the mark accuracy rate of keyword and specialized vocabulary, it can be widely used in medical literature processing.
Description
Technical field
The invention belongs to machine learning techniques field, especially a kind of medical files specialized vocabulary automates mask method.
Background technique
With the development of medical research community, all there can be more and more paper publishings to come out every year.People increasingly need
Find the improved method for paper, and the key idea in automatic understanding these papers.However, due to various necks
Domain and extremely limited annotation resource, it is relatively fewer to the extraction of scientific information.
Meanwhile as people increase sharply for the demand of medical resource, corresponding medical files and case quantity, cause to study
Personnel and medical staff need quickly to arrange the past medical data of patient.It can quickly be helped from patient's case
Often some professional vocabulary or the keyword that medical staff judges are helped, these vocabulary of manual sorting and keyword need
Time that will be very more, since manpower limits, it is impossible to soon complete a large amount of cases, the housekeeping of medical data.
In conclusion how to be marked automatically to specialized vocabulary or keyword with the rising for medical resource demand
Note is to promote medical staff for case, the processing speed of medical data and to help them preferably and be Case treatment be to compel at present
Cut problem to be solved.
Summary of the invention
It is an object of the invention to overcome the deficiencies in the prior art, propose a kind of medical files specialized vocabulary automation mark
Method uses semi-supervised learning algorithm to expand data volume, overcomes previous medical text marking data volume deficiency and leads
The poor problem of the model performance of cause, and finally improve the accuracy for identifying specialized vocabulary and keyword in the text.
The present invention solves its technical problem and adopts the following technical solutions to achieve:
A kind of medical files specialized vocabulary automation mask method, comprising the following steps:
Step 1 carries out data prediction to the medical files of input, obtains pretreated medical files text;
Step 2 models text using biLSTM, obtains the alphabetical grade feature vector of word;
Step 3 models text using word2vec, obtains the word level feature vector of word;
Step 4 is based on text language pragmatic Features, obtains the language feature vector of word;
Step 5, alphabetical grade feature vector, word level feature vector and the language that step 2, step 3 and step 4 are obtained to word
Feature vector is merged, the coding vector as word;
The word of medical files text after participle is labeled as following four classes medicine entity: disease name, disease by step 6
Symptom, treatment means and medicine name, every one kind entity indicate specific location of the word in the entity with IOBES, are marked
Data set;
Step 7, using the coding vector of the text that step 1 obtains and the word that step 5 obtains as the input of biLSTM, it is right
Each word exports space representation of the multi-C vector as word;
Step 8 obtains enhanced labeled data collection using label propagation algorithm extension labeled data collection;
Step 9, using the multi-C vector of step 7 as the space representation of word as the vector of word, the enhancing that step 8 is obtained
Labeled data collection input condition random field afterwards is trained modeling, and final output annotation results.
Further, the concrete methods of realizing of the step 1 are as follows: the medical files of input are segmented first, form one
A array stores each word and punctuation mark in text, then removes stop words, finally extracts stem and lemmatization, obtains
To the citation form of word, and constitute the word array not marked.
Further, the concrete methods of realizing of the step 2 are as follows: using biLSTM to pretreated medical files text
Alphabetical grade feature is encoded, and is encoded using five letters of head of each word, finally show that length is the alphabetical grade of 5d
Feature vector.
Further, the concrete methods of realizing of the step 3 are as follows: using the Word2Vec algorithm of Google to pretreated
The word level feature of medical files text is encoded, finally obtain length be d the word level feature for each word to
Amount.
Further, the concrete methods of realizing of the step 4 are as follows: according to text language pragmatic Features, using manual definition side
Method, to the pretreated following feature of medical files text definition: initial capital and small letter, word all-lowercase, word are all big
It writes, part of speech and syntactic structure, the feature vector that formation length is 21, each feature are indicated with 0 or 1.
Further, the concrete methods of realizing of the step 5 are as follows: by alphabetical grade feature vector, word level feature vector and language
Speech feature vector links together, and forms the feature vector for the synthesis for each word that a length is 5d+d+21.
Further, the labeled data collection of the step 6 is the combination tag for including 20 classifications.
Further, the concrete methods of realizing of the step 7 are as follows: the combination formed using three kinds of features that step 5 obtains is special
Vector is levied, and all feature vectors of entire word array are arranged, forms training data matrix, the number of the row of the matrix
Amount is the quantity of the word in word array, and matrix column number is 5d+d+21;Using biLSTM, by forwardly and rearwardly calculating
The hidden layer of journey passes to linear layer as input, which is 20 by the size of dimensional projections to tag types space, and
As CRF layers of input.
Further, the concrete methods of realizing of the step 8 are as follows: firstly, based on feature vector structure figures corresponding to word,
And as the node in figure, their distance and weight w are defined using the similarity between feature vectoruv, figure interior joint
Sum is equal to the sum of Unlabeled data and marked data;Then, it is minimized using label propagation algorithm by optimization
The objective function of Kullback- Leibler distance is distributed the label between adjacent node as similar to each other as possible, finally makes
It obtains the corresponding word of all figure interior joints to be marked, obtains enhanced data set.
Further, the concrete methods of realizing of the step 9 are as follows: the space representation of the word for the multidimensional for obtaining step 7 as
The vector of word, it includes the probability distribution for each label that biLSTM, which eventually exports mark matrix P, the P mark matrix,
It is poured into CRF layers, obtains an annotated sequence y, the score φ (y of sequence of calculation y;X, θ), then calculate annotated sequence y and exist
The probability P occurred in all annotated sequencesθ(y | x), finally using backpropagation for objective function log-It is maximized, to complete supervised learning, while the CRF model is exported as final result.
The advantages and positive effects of the present invention are:
1, the keyword in medical literature is divided into disease name (disease), symptom (symptom), treatment by the present invention
Means (treatment-method) and medicine name (Drug-name) these four classifications, and it is based on semi-supervised learning mark side
Method carries out the mark in specialized vocabulary for medical files or case, can be in the case where manpower and material resources consume extremely low, for doctor
The content in text is understood quickly in shield personnel or scholar, preferably makes medical decision making or research.
2, the present invention is labeled a large amount of unlabeled data using semi-supervised learning algorithm, successfully overcomes existing doctor
Treat the very few defect of industry labeled data, effectively improve the data volume that model is able to use, and be substantially improved algorithm for
The mark accuracy rate of keyword and specialized vocabulary can be widely used in medical literature processing.
Detailed description of the invention
Fig. 1 is process flow diagram of the invention.
Specific embodiment
The embodiment of the present invention is further described below in conjunction with attached drawing.
Design philosophy of the invention: utilize machine learning algorithm and technology, and based on semi-supervised learning mask method for
Medical files or case carry out the mark in specialized vocabulary.The neural network that the present invention constructs one three layers of layering is come pair
Text is marked: (1) word in text carries out the feature extraction of vectorization using three kinds of modes, and BiLSTM, which is extracted, is based on word
Female feature, Word2Vec do word insertion, and the feature extraction based on syntactic structure to word.(2) BiLSTM is extracted same
In one sentence, the contextual information being centered around around word, and encoded.(3) CRF target is used in combination in CRF mark layer
Function makes final label judgement to word and label tag modeling.
Based on above-mentioned design philosophy, medical files specialized vocabulary of the invention automates mask method, as shown in Figure 1, packet
Include following steps:
Step 1: data prediction being carried out to the medical files of input, obtains pretreated medical files text.
In this step, it inputs as medical files, exports as word array.Data preprocessing method are as follows: to medical files
It is segmented first, forms an array, store each word and punctuation mark in text, then remove stop words, such as is ",
" but ", " shall ", " by " extract stem and lemmatization later, obtain the citation form of word.For example, running,
Ran, runs after extracting stem, obtain run word, and lemmatization is substantially similar, any type of vocabulary can be reduced to
General type, the word array not marked for obtaining being made of general type by data prediction.
Step 2: modeling text using BiLSTM, obtain the alphabetical grade feature vector of word.
The input of this step be carry out data prediction after word array, export for the feature based on alphabetic feature to
Amount, length 5d.
The present invention encodes alphabetic feature using biLSTM, referred to as Character-BasedEmbedding.
The alphabetical grade feature of word is generated by propagating forward for BiLSTM with the hidden layer vector in calculating process backward, and building is based on character
Embeding layer the advantages of be that it can handle vocabulary and formula except vocabulary, these vocabulary and formula in these data very
Common, the feature vector length of generation is set as d.But the present invention using stem 5-gram (i.e. take word from a left side
Turn right, 5 most started letter encodes, if without 5 letters, remaining length makees zero padding processing) feature is extracted, finally
Feature vector length is 5d.
Step 3: modeling text using word2vec, obtain the word level feature vector of word.
In this step, it is mapped to vector space using the word of fixed vocabulary (marking plus unknown words), made
Initialized with the Word2vec pre-training combined with different corpus, word using Google Word2Vec algorithm into
Row coding, referred to as WordEmbedding finally show that the feature vector length for each word is d.
Step 4: being based on text language pragmatic Features, design obtains the language feature vector of word.
In this step, the word array only to be segmented to original text is inputted, is exported to be designed based on language feature
Feature vector, length 21.Feature is not trained individually, is manual definition, referred to as FeatureEmbedding.The part is fixed
The feature of justice includes: initial capital and small letter, and word all-lowercase, word all Caps, part of speech and syntactic structure etc. are 21 total
Feature, being formed by feature vector length is 21, and each feature is indicated with 0 or 1, indicates them either with or without corresponding feature.
Step 5: alphabetical grade feature, word level feature and the language feature of the word that step 2,3,4 obtain being merged, made
For the coding vector of word.
The input of this step be the feature vector of character level, the feature vector of word level, word language feature vector vector,
Above-mentioned three feature vectors are linked together, formed a length be 5d+d+21 the synthesis for each word feature to
Amount.
Step 6: the word of the electronic health record text after participle labeled data: being labeled as four class medicine entity (disease, diseases
Shape, treatment means, medicine name), every one kind entity indicates specific location of the word in the entity with IOBES, is labeled as 20 altogether
Class obtains labeled data collection.
In order to the span for two continuous key phrases for distinguishing same type, the present invention is each word in sentence
Specified label specifies their positions and type in phrase.On the basis of preprocessed data, on each word mark
Corresponding label, the position of phrase and corresponding classification, indicate their position, the present invention uniformly makes first where indicating them
One word is described with IOBES (Inside, Outside, Begining, End and Singleton) in professional phrase or word
Position in remittance, I indicate word in the inside of phrase, B indicate they in the beginning of phrase, E indicate they in the end of phrase,
S indicates that this word is an individual specialized vocabulary, and 0 indicates to be included in sentence in the outside of the phrase.Simultaneously in conjunction with this
The classification of these specialized vocabularies and phrase tagging, including disease name (disease), symptom (symptom) are treated hand by invention
Section (treatment-method), medicine name (Drug-name) is comprehensive to form a complete label, for example,
Criticialill patients (Critical Ill Patient) is noted as " B-symptom I-symptom E-symptom ".In this way
The combination tag of formation just has altogether 20 classifications.Since training set quantity is very huge, the present invention is labelled with a portion
Data thus constitute in data set labeled data and unlabeled data.
Step 7: using the coding vector of the word in text pretreated in step 1 and step 5 as the defeated of biLSTM
Enter, set 20 for output, space representation of the vector of one 20 dimension as word is exported to each word.
In this step, the assemblage characteristic vector that the three kinds of features obtained using step 5 are formed, and by entire word array
All feature vectors arrange, formed training data matrix, the quantity of the row of matrix is the quantity of the word in word array,
Matrix column number is 5d+d+21, and the present invention uses biLSTM, is passed by the hidden layer of forwardly and rearwardly calculating process as input
Linear layer is passed, which, by dimensional projections to the size in tag types space, is 20, and is used as CRF layers of input.
Step 8: data enhancing being carried out to the labeled data collection that step 6 obtains, extends mark number using label propagation algorithm
Enhanced labeled data collection is obtained according to collection.
This step includes two parts: first part is based on feature vector structure figures corresponding to word, and as in figure
Node, their distance and weight w are defined using the similarity between feature vectoruv, the sum of figure interior joint is not equal to
The sum of flag data and marked data.Second part is using label propagation algorithm, which is intended to minimize by optimization
The objective function of Kullback- Leibler distance is distributed the label between adjacent node as similar to each other as possible.The part
The corresponding word of all figure interior joints is marked, enhanced data set is obtained.Method particularly includes:
First part constructs relational graph required for label propagation algorithm first, the feature vector institute on the vertex and word in figure
It is corresponding, while being the distance between word feature, capture Semantic Similarity.The total size of figure is equal to flag data amount VlWith it is unmarked
Data volume VuThe sum of.With the one group of word trained in advance insertion, (dimension d) modeling, wherein 5-gram is with 5, head of current word
Letter, the word closest to verb are embedded in, and one including part group of part of speech label and capital and small letter (43 and 4 dimensions
One-hot vector).Then the gained eigenvector projection that length is 5d+d+43+4 is tieed up to 100 using PCA dimension-reduction algorithm.
The weight w on the side between definition node u and v of the present inventionuvIt is as follows: wuv=de(u, v) ifv ∈ κ (u) oru ∈ κ (v), wherein κ (u)
It is the set of the k-nearest neighbours of u, de(u, v) is the Euclidean distance in figure between any two node u and v.
For each node i in figure, the present invention calculates marginal probability { q using forwardly and rearwardly calculating processi}.If θi
Indicate the estimation of CRF parameter after n-th iteration, the present invention is for each mark position in sentence j in label and Unlabeled data
The IOBES label of i calculates marginal probability
Second part enhances data using label propagation algorithm, unlabeled data in data set is labeled, the calculation
Method is intended to minimize the objective function of Kullback-Leibler distance by optimization, is distributed the label between adjacent node to the greatest extent
May be similar to each other, by minimizing following Kullback-Leibler distance: i) for the node of words all in figure, reference numerals
According to distribution ruQ is distributed with the label of predictionu.Ii) all node u and its adjacent node v, q in figureuAnd qvDistribution.Iii) institute
It is distributed the q of nodeuWith CRF marginal probabilityIf node is not attached to marked vertex, Section 3 advises prediction distribution
Then turn to CRF prediction, it is ensured that algorithm is good at least as standard self training.
Step 9: using the space representation of the word of 20 dimensions of step 7 as the vector of word, the enhanced mark that step 8 is obtained
Note data set input condition random field is trained modeling.
Using the space representation of the word of 20 dimensions of step 7 as the vector of word, biLSTM eventually exports a mark matrix
P, P matrix have included the probability distribution for each label substantially, are then poured into CRF layers, can obtain a mark
Sequences y passes through the score φ (y of method sequence of calculation y;X, θ), then calculate what annotated sequence y occurred in all annotated sequences
Probability Pθ(y | x), finally using backpropagation for objective functionIt is maximized, with
Supervised learning is completed, while the CRF model is also exported as final result.The specific method is as follows:
Keyword classification be it is a kind of between output label there are the task of strong dependency for example, I-disease cannot be
Behind B-treatment-method, therefore, the present invention does not make independent label decision to each output, uses item
Part random field carries out joint modeling to them.For inputting sentence x=(x1, x2, x3 ..., xn), it is considered herein that P is
The score matrix of biLSTM network output.The size of P is n × m, and wherein n is the reference numerals in sentence, and m is the number of not isolabeling
Amount.PT, iThe score for i-th of label for dying a word corresponding in sentence the.The present invention is using first-order Markov model and determines
Adopted transition matrix T, wherein TI, jTable is from label i to the score of label j.The present invention increases y simultaneously0And ynAs beginning and end
Identifier.Therefore the dimension of T matrix becomes m+2.
Score is defined as by a given possible output y and neural network parameter θ, the present invention
The probability of sequences y is obtained by applying softmax on all possible sequence label:
Pθ(y | x)=exp (φ (y;X, θ))/∑y′∈Yexp(φ(y′;X, θ))
Wherein Y indicates all possible sequence label.Normalization item can be effectively calculated using forwards algorithms.
Initial training finally is carried out using the label data in data set, during the training period, the present invention maximizes given
Log-probabilityL (the Y of the correct sequence label of corpus { X, Y };X, θ).Simultaneously based on the gross score for using sentence
The gradient of calculating completes backpropagation.
After obtaining trained CRF algorithm, by it in conjunction with characteristic extraction part constructed before, so that it may start pair
It is labeled in text.Input a sentence x=(x1, x2, x3..., xn), an annotated sequence y=(y can be obtained1, y2,
y3..., yn)。
Meaning of parameters in above-mentioned formula is described as follows:
The present invention does not address place and is suitable for the prior art.
It is emphasized that embodiment of the present invention be it is illustrative, without being restrictive, therefore packet of the present invention
Include and be not limited to embodiment described in specific embodiment, it is all by those skilled in the art according to the technique and scheme of the present invention
The other embodiments obtained, also belong to the scope of protection of the invention.
Claims (10)
1. a kind of medical files specialized vocabulary automates mask method, it is characterised in that the following steps are included:
Step 1 carries out data prediction to the medical files of input, obtains pretreated medical files text;
Step 2 models text using biLSTM, obtains the alphabetical grade feature vector of word;
Step 3 models text using word2vec, obtains the word level feature vector of word;
Step 4 is based on text language pragmatic Features, obtains the language feature vector of word;
Step 5, alphabetical grade feature vector, word level feature vector and the language feature that step 2, step 3 and step 4 are obtained to word
Vector is merged, the coding vector as word;
The word of medical files text after participle is labeled as following four classes medicine entity by step 6: disease name, disease symptoms,
Treatment means and medicine name, every one kind entity indicate specific location of the word in the entity with IOBES, obtain labeled data
Collection;
Step 7, using the coding vector of the text that step 1 obtains and the word that step 5 obtains as the input of biLSTM, to each
A word exports space representation of the multi-C vector as word;
Step 8 obtains enhanced labeled data collection using label propagation algorithm extension labeled data collection;
Step 9, using the multi-C vector of step 7 as the space representation of word as the vector of word, step 8 is obtained enhanced
Labeled data collection input condition random field is trained modeling, and final output annotation results.
2. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step
Rapid 1 concrete methods of realizing are as follows: the medical files of input are segmented first, form an array, are stored every in text
Then a word and punctuation mark remove stop words, finally extract stem and lemmatization, obtain the citation form of word, and structure
At the word array not marked.
3. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step
Rapid 2 concrete methods of realizing are as follows: encoded, made using alphabetical grade feature of the biLSTM to pretreated medical files text
It is encoded with five letters of head of each word, finally show that length is the alphabetical grade feature vector of 5d.
4. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step
Rapid 3 concrete methods of realizing are as follows: using the Word2Vec algorithm of Google to the word level of pretreated medical files text
Feature is encoded, and the word level feature vector for each word that length is d is finally obtained.
5. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step
Rapid 4 concrete methods of realizing are as follows: according to text language pragmatic Features, using manual definition method, to pretreated medicine text
The shelves following feature of text definition: initial capital and small letter, word all-lowercase, word all Caps, part of speech and syntactic structure are formed
The feature vector that length is 21, each feature are indicated with 0 or 1.
6. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step
Rapid 5 concrete methods of realizing are as follows: alphabetical grade feature vector, word level feature vector and language feature vector are linked together,
Form the feature vector for the synthesis for each word that a length is 5d+d+21.
7. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step
Rapid 6 labeled data collection is the combination tag for including 20 classifications.
8. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step
Rapid 7 concrete methods of realizing are as follows: the assemblage characteristic vector that the three kinds of features obtained using step 5 are formed, and by entire word number
All feature vectors of group are arranged, and form training data matrix, the quantity of the row of the matrix is the word in word array
Quantity, matrix column number are 5d+d+21;Using biLSTM, passed by the hidden layer of forwardly and rearwardly calculating process as input
Linear layer is passed, which is 20 by the size of dimensional projections to tag types space, and is used as CRF layers of input.
9. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step
Rapid 8 concrete methods of realizing are as follows: firstly, based on feature vector structure figures corresponding to word, and as the node in figure, it uses
Similarity between feature vector defines their distance and weight wuv, the sum of figure interior joint is equal to Unlabeled data and
The sum of flag data;Then, the target letter of Kullback-Leibler distance is minimized by optimization using label propagation algorithm
Number is distributed the label between adjacent node as similar to each other as possible, marks the corresponding word of all figure interior joints
Note, obtains enhanced data set.
10. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: described
The concrete methods of realizing of step 9 are as follows: for the space representation of the word for the multidimensional for obtaining step 7 as the vector of word, biLSTM is final
It includes the probability distribution for each label that mark matrix P, the P mark matrix, which can be exported, is poured into CRF layers, obtains
Score φ (the y of an annotated sequence y out, sequence of calculation y;X, θ), then calculate annotated sequence y and occur in all annotated sequences
Probability Pθ(y | x), finally using backpropagation for objective functionIt is maximized,
To complete supervised learning, while the CRF model is exported as final result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910265223.3A CN110059185B (en) | 2019-04-03 | 2019-04-03 | Medical document professional vocabulary automatic labeling method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910265223.3A CN110059185B (en) | 2019-04-03 | 2019-04-03 | Medical document professional vocabulary automatic labeling method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110059185A true CN110059185A (en) | 2019-07-26 |
CN110059185B CN110059185B (en) | 2022-10-04 |
Family
ID=67318293
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910265223.3A Active CN110059185B (en) | 2019-04-03 | 2019-04-03 | Medical document professional vocabulary automatic labeling method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110059185B (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111063446A (en) * | 2019-12-17 | 2020-04-24 | 医渡云(北京)技术有限公司 | Method, apparatus, device and storage medium for standardizing medical text data |
CN111651991A (en) * | 2020-04-15 | 2020-09-11 | 天津科技大学 | Medical named entity identification method utilizing multi-model fusion strategy |
CN111666406A (en) * | 2020-04-13 | 2020-09-15 | 天津科技大学 | Short text classification prediction method based on word and label combination of self-attention |
CN111797263A (en) * | 2020-07-08 | 2020-10-20 | 北京字节跳动网络技术有限公司 | Image label generation method, device, equipment and computer readable medium |
CN111797612A (en) * | 2020-05-15 | 2020-10-20 | 中国科学院软件研究所 | Method for extracting automatic data function items |
CN112101014A (en) * | 2020-08-20 | 2020-12-18 | 淮阴工学院 | Chinese chemical industry document word segmentation method based on mixed feature fusion |
CN113297852A (en) * | 2021-07-26 | 2021-08-24 | 北京惠每云科技有限公司 | Medical entity word recognition method and device |
CN113808752A (en) * | 2020-12-04 | 2021-12-17 | 四川医枢科技股份有限公司 | Medical document identification method, device and equipment |
CN114386424A (en) * | 2022-03-24 | 2022-04-22 | 上海帜讯信息技术股份有限公司 | Industry professional text automatic labeling method, industry professional text automatic labeling device, industry professional text automatic labeling terminal and industry professional text automatic labeling storage medium |
CN115292498A (en) * | 2022-08-19 | 2022-11-04 | 北京华宇九品科技有限公司 | Document classification method, system, computer equipment and storage medium |
CN115563311A (en) * | 2022-10-21 | 2023-01-03 | 中国能源建设集团广东省电力设计研究院有限公司 | Document marking and knowledge base management method and knowledge base management system |
CN115858819A (en) * | 2023-01-29 | 2023-03-28 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Sample data augmentation method and device |
TWI797537B (en) * | 2020-01-13 | 2023-04-01 | 加拿大商知識研究有限公司 | Method and system of using hierarchical vectorisation for representation of healthcare data |
CN117034917A (en) * | 2023-10-08 | 2023-11-10 | 中国医学科学院医学信息研究所 | English text word segmentation method, device and computer readable medium |
CN117095782A (en) * | 2023-10-20 | 2023-11-21 | 上海森亿医疗科技有限公司 | Medical text quick input method, system, terminal and editor |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8745093B1 (en) * | 2000-09-28 | 2014-06-03 | Intel Corporation | Method and apparatus for extracting entity names and their relations |
CN108491382A (en) * | 2018-03-14 | 2018-09-04 | 四川大学 | A kind of semi-supervised biomedical text semantic disambiguation method |
CN108664473A (en) * | 2018-05-11 | 2018-10-16 | 平安科技(深圳)有限公司 | Recognition methods, electronic device and the readable storage medium storing program for executing of text key message |
CN108829801A (en) * | 2018-06-06 | 2018-11-16 | 大连理工大学 | A kind of event trigger word abstracting method based on documentation level attention mechanism |
CN108831559A (en) * | 2018-06-20 | 2018-11-16 | 清华大学 | A kind of Chinese electronic health record text analyzing method and system |
CN109299262A (en) * | 2018-10-09 | 2019-02-01 | 中山大学 | A kind of text implication relation recognition methods for merging more granular informations |
-
2019
- 2019-04-03 CN CN201910265223.3A patent/CN110059185B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8745093B1 (en) * | 2000-09-28 | 2014-06-03 | Intel Corporation | Method and apparatus for extracting entity names and their relations |
CN108491382A (en) * | 2018-03-14 | 2018-09-04 | 四川大学 | A kind of semi-supervised biomedical text semantic disambiguation method |
CN108664473A (en) * | 2018-05-11 | 2018-10-16 | 平安科技(深圳)有限公司 | Recognition methods, electronic device and the readable storage medium storing program for executing of text key message |
CN108829801A (en) * | 2018-06-06 | 2018-11-16 | 大连理工大学 | A kind of event trigger word abstracting method based on documentation level attention mechanism |
CN108831559A (en) * | 2018-06-20 | 2018-11-16 | 清华大学 | A kind of Chinese electronic health record text analyzing method and system |
CN109299262A (en) * | 2018-10-09 | 2019-02-01 | 中山大学 | A kind of text implication relation recognition methods for merging more granular informations |
Non-Patent Citations (2)
Title |
---|
尚福华等: "一种半监督三维模型语义自动标注方法", 《计算机工程与应用》 * |
林广和等: "基于细粒度词表示的命名实体识别研究", 《中文信息学报》 * |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111063446A (en) * | 2019-12-17 | 2020-04-24 | 医渡云(北京)技术有限公司 | Method, apparatus, device and storage medium for standardizing medical text data |
TWI797537B (en) * | 2020-01-13 | 2023-04-01 | 加拿大商知識研究有限公司 | Method and system of using hierarchical vectorisation for representation of healthcare data |
CN111666406A (en) * | 2020-04-13 | 2020-09-15 | 天津科技大学 | Short text classification prediction method based on word and label combination of self-attention |
CN111666406B (en) * | 2020-04-13 | 2023-03-31 | 天津科技大学 | Short text classification prediction method based on word and label combination of self-attention |
CN111651991B (en) * | 2020-04-15 | 2022-08-26 | 天津科技大学 | Medical named entity identification method utilizing multi-model fusion strategy |
CN111651991A (en) * | 2020-04-15 | 2020-09-11 | 天津科技大学 | Medical named entity identification method utilizing multi-model fusion strategy |
CN111797612A (en) * | 2020-05-15 | 2020-10-20 | 中国科学院软件研究所 | Method for extracting automatic data function items |
CN111797263A (en) * | 2020-07-08 | 2020-10-20 | 北京字节跳动网络技术有限公司 | Image label generation method, device, equipment and computer readable medium |
CN112101014A (en) * | 2020-08-20 | 2020-12-18 | 淮阴工学院 | Chinese chemical industry document word segmentation method based on mixed feature fusion |
CN113808752A (en) * | 2020-12-04 | 2021-12-17 | 四川医枢科技股份有限公司 | Medical document identification method, device and equipment |
CN113297852B (en) * | 2021-07-26 | 2021-11-12 | 北京惠每云科技有限公司 | Medical entity word recognition method and device |
CN113297852A (en) * | 2021-07-26 | 2021-08-24 | 北京惠每云科技有限公司 | Medical entity word recognition method and device |
CN114386424B (en) * | 2022-03-24 | 2022-06-10 | 上海帜讯信息技术股份有限公司 | Industry professional text automatic labeling method, industry professional text automatic labeling device, industry professional text automatic labeling terminal and industry professional text automatic labeling storage medium |
CN114386424A (en) * | 2022-03-24 | 2022-04-22 | 上海帜讯信息技术股份有限公司 | Industry professional text automatic labeling method, industry professional text automatic labeling device, industry professional text automatic labeling terminal and industry professional text automatic labeling storage medium |
CN115292498A (en) * | 2022-08-19 | 2022-11-04 | 北京华宇九品科技有限公司 | Document classification method, system, computer equipment and storage medium |
CN115563311A (en) * | 2022-10-21 | 2023-01-03 | 中国能源建设集团广东省电力设计研究院有限公司 | Document marking and knowledge base management method and knowledge base management system |
CN115563311B (en) * | 2022-10-21 | 2023-09-15 | 中国能源建设集团广东省电力设计研究院有限公司 | Document labeling and knowledge base management method and knowledge base management system |
CN115858819A (en) * | 2023-01-29 | 2023-03-28 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Sample data augmentation method and device |
CN115858819B (en) * | 2023-01-29 | 2023-05-16 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Sample data amplification method and device |
CN117034917A (en) * | 2023-10-08 | 2023-11-10 | 中国医学科学院医学信息研究所 | English text word segmentation method, device and computer readable medium |
CN117034917B (en) * | 2023-10-08 | 2023-12-22 | 中国医学科学院医学信息研究所 | English text word segmentation method, device and computer readable medium |
CN117095782A (en) * | 2023-10-20 | 2023-11-21 | 上海森亿医疗科技有限公司 | Medical text quick input method, system, terminal and editor |
CN117095782B (en) * | 2023-10-20 | 2024-02-06 | 上海森亿医疗科技有限公司 | Medical text quick input method, system, terminal and editor |
Also Published As
Publication number | Publication date |
---|---|
CN110059185B (en) | 2022-10-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110059185A (en) | A kind of medical files specialized vocabulary automation mask method | |
CN111241294B (en) | Relationship extraction method of graph convolution network based on dependency analysis and keywords | |
CN109753660B (en) | LSTM-based winning bid web page named entity extraction method | |
CN109871538A (en) | A kind of Chinese electronic health record name entity recognition method | |
CN109508459B (en) | Method for extracting theme and key information from news | |
CN111966917A (en) | Event detection and summarization method based on pre-training language model | |
CN111858940B (en) | Multi-head attention-based legal case similarity calculation method and system | |
CN111222318B (en) | Trigger word recognition method based on double-channel bidirectional LSTM-CRF network | |
CN110532328A (en) | A kind of text concept figure building method | |
CN111400455A (en) | Relation detection method of question-answering system based on knowledge graph | |
CN112966525B (en) | Law field event extraction method based on pre-training model and convolutional neural network algorithm | |
CN110413768A (en) | A kind of title of article automatic generation method | |
CN110750646B (en) | Attribute description extracting method for hotel comment text | |
CN110781290A (en) | Extraction method of structured text abstract of long chapter | |
CN112800184B (en) | Short text comment emotion analysis method based on Target-Aspect-Opinion joint extraction | |
CN113360667B (en) | Biomedical trigger word detection and named entity identification method based on multi-task learning | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN112101014A (en) | Chinese chemical industry document word segmentation method based on mixed feature fusion | |
CN115545021A (en) | Clinical term identification method and device based on deep learning | |
CN115935995A (en) | Knowledge graph generation-oriented non-genetic-fabric-domain entity relationship extraction method | |
CN116049394A (en) | Long text similarity comparison method based on graph neural network | |
CN110929518A (en) | Text sequence labeling algorithm using overlapping splitting rule | |
CN112800244B (en) | Method for constructing knowledge graph of traditional Chinese medicine and national medicine | |
CN114265936A (en) | Method for realizing text mining of science and technology project | |
CN117236338B (en) | Named entity recognition model of dense entity text and training method thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20240408 Address after: Room 1518B, Unit 2, 12th Floor, Huizhi Building, No. 9 Xueqing Road, Haidian District, Beijing, 100080 Patentee after: Beijing contention Technology Co.,Ltd. Country or region after: China Address before: No.9, 13th Street, economic and Technological Development Zone, Binhai New Area, Tianjin Patentee before: TIANJIN University OF SCIENCE AND TECHNOLOGY Country or region before: China |
|
TR01 | Transfer of patent right |