CN110059185A - A kind of medical files specialized vocabulary automation mask method - Google Patents

A kind of medical files specialized vocabulary automation mask method Download PDF

Info

Publication number
CN110059185A
CN110059185A CN201910265223.3A CN201910265223A CN110059185A CN 110059185 A CN110059185 A CN 110059185A CN 201910265223 A CN201910265223 A CN 201910265223A CN 110059185 A CN110059185 A CN 110059185A
Authority
CN
China
Prior art keywords
word
medical files
vector
feature vector
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910265223.3A
Other languages
Chinese (zh)
Other versions
CN110059185B (en
Inventor
王嫄
高铭
王栋
赵婷婷
赵青
陈亚瑞
史艳翠
孔娜
王洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Contention Technology Co ltd
Original Assignee
Tianjin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University of Science and Technology filed Critical Tianjin University of Science and Technology
Priority to CN201910265223.3A priority Critical patent/CN110059185B/en
Publication of CN110059185A publication Critical patent/CN110059185A/en
Application granted granted Critical
Publication of CN110059185B publication Critical patent/CN110059185B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to a kind of medical files specialized vocabularies to automate mask method, comprising: carries out data prediction to the medical files of input, obtains pretreated medical files text;It obtains the alphabetical grade feature vector of word, word level feature vector, language feature vector and is merged, the coding vector as word;The word mark classification of medical files text after participle is obtained into labeled data collection;Space representation of one multi-C vector as word is exported to each word;Obtain enhanced labeled data collection;It is trained modeling, and final output annotation results.The present invention has rational design, it uses semi-supervised learning algorithm to be labeled a large amount of unlabeled data, successfully overcome the very few defect of existing medical industry labeled data, effectively improve the data volume that model is able to use, and algorithm is substantially improved for the mark accuracy rate of keyword and specialized vocabulary, it can be widely used in medical literature processing.

Description

A kind of medical files specialized vocabulary automation mask method
Technical field
The invention belongs to machine learning techniques field, especially a kind of medical files specialized vocabulary automates mask method.
Background technique
With the development of medical research community, all there can be more and more paper publishings to come out every year.People increasingly need Find the improved method for paper, and the key idea in automatic understanding these papers.However, due to various necks Domain and extremely limited annotation resource, it is relatively fewer to the extraction of scientific information.
Meanwhile as people increase sharply for the demand of medical resource, corresponding medical files and case quantity, cause to study Personnel and medical staff need quickly to arrange the past medical data of patient.It can quickly be helped from patient's case Often some professional vocabulary or the keyword that medical staff judges are helped, these vocabulary of manual sorting and keyword need Time that will be very more, since manpower limits, it is impossible to soon complete a large amount of cases, the housekeeping of medical data.
In conclusion how to be marked automatically to specialized vocabulary or keyword with the rising for medical resource demand Note is to promote medical staff for case, the processing speed of medical data and to help them preferably and be Case treatment be to compel at present Cut problem to be solved.
Summary of the invention
It is an object of the invention to overcome the deficiencies in the prior art, propose a kind of medical files specialized vocabulary automation mark Method uses semi-supervised learning algorithm to expand data volume, overcomes previous medical text marking data volume deficiency and leads The poor problem of the model performance of cause, and finally improve the accuracy for identifying specialized vocabulary and keyword in the text.
The present invention solves its technical problem and adopts the following technical solutions to achieve:
A kind of medical files specialized vocabulary automation mask method, comprising the following steps:
Step 1 carries out data prediction to the medical files of input, obtains pretreated medical files text;
Step 2 models text using biLSTM, obtains the alphabetical grade feature vector of word;
Step 3 models text using word2vec, obtains the word level feature vector of word;
Step 4 is based on text language pragmatic Features, obtains the language feature vector of word;
Step 5, alphabetical grade feature vector, word level feature vector and the language that step 2, step 3 and step 4 are obtained to word Feature vector is merged, the coding vector as word;
The word of medical files text after participle is labeled as following four classes medicine entity: disease name, disease by step 6 Symptom, treatment means and medicine name, every one kind entity indicate specific location of the word in the entity with IOBES, are marked Data set;
Step 7, using the coding vector of the text that step 1 obtains and the word that step 5 obtains as the input of biLSTM, it is right Each word exports space representation of the multi-C vector as word;
Step 8 obtains enhanced labeled data collection using label propagation algorithm extension labeled data collection;
Step 9, using the multi-C vector of step 7 as the space representation of word as the vector of word, the enhancing that step 8 is obtained Labeled data collection input condition random field afterwards is trained modeling, and final output annotation results.
Further, the concrete methods of realizing of the step 1 are as follows: the medical files of input are segmented first, form one A array stores each word and punctuation mark in text, then removes stop words, finally extracts stem and lemmatization, obtains To the citation form of word, and constitute the word array not marked.
Further, the concrete methods of realizing of the step 2 are as follows: using biLSTM to pretreated medical files text Alphabetical grade feature is encoded, and is encoded using five letters of head of each word, finally show that length is the alphabetical grade of 5d Feature vector.
Further, the concrete methods of realizing of the step 3 are as follows: using the Word2Vec algorithm of Google to pretreated The word level feature of medical files text is encoded, finally obtain length be d the word level feature for each word to Amount.
Further, the concrete methods of realizing of the step 4 are as follows: according to text language pragmatic Features, using manual definition side Method, to the pretreated following feature of medical files text definition: initial capital and small letter, word all-lowercase, word are all big It writes, part of speech and syntactic structure, the feature vector that formation length is 21, each feature are indicated with 0 or 1.
Further, the concrete methods of realizing of the step 5 are as follows: by alphabetical grade feature vector, word level feature vector and language Speech feature vector links together, and forms the feature vector for the synthesis for each word that a length is 5d+d+21.
Further, the labeled data collection of the step 6 is the combination tag for including 20 classifications.
Further, the concrete methods of realizing of the step 7 are as follows: the combination formed using three kinds of features that step 5 obtains is special Vector is levied, and all feature vectors of entire word array are arranged, forms training data matrix, the number of the row of the matrix Amount is the quantity of the word in word array, and matrix column number is 5d+d+21;Using biLSTM, by forwardly and rearwardly calculating The hidden layer of journey passes to linear layer as input, which is 20 by the size of dimensional projections to tag types space, and As CRF layers of input.
Further, the concrete methods of realizing of the step 8 are as follows: firstly, based on feature vector structure figures corresponding to word, And as the node in figure, their distance and weight w are defined using the similarity between feature vectoruv, figure interior joint Sum is equal to the sum of Unlabeled data and marked data;Then, it is minimized using label propagation algorithm by optimization The objective function of Kullback- Leibler distance is distributed the label between adjacent node as similar to each other as possible, finally makes It obtains the corresponding word of all figure interior joints to be marked, obtains enhanced data set.
Further, the concrete methods of realizing of the step 9 are as follows: the space representation of the word for the multidimensional for obtaining step 7 as The vector of word, it includes the probability distribution for each label that biLSTM, which eventually exports mark matrix P, the P mark matrix, It is poured into CRF layers, obtains an annotated sequence y, the score φ (y of sequence of calculation y;X, θ), then calculate annotated sequence y and exist The probability P occurred in all annotated sequencesθ(y | x), finally using backpropagation for objective function log-It is maximized, to complete supervised learning, while the CRF model is exported as final result.
The advantages and positive effects of the present invention are:
1, the keyword in medical literature is divided into disease name (disease), symptom (symptom), treatment by the present invention Means (treatment-method) and medicine name (Drug-name) these four classifications, and it is based on semi-supervised learning mark side Method carries out the mark in specialized vocabulary for medical files or case, can be in the case where manpower and material resources consume extremely low, for doctor The content in text is understood quickly in shield personnel or scholar, preferably makes medical decision making or research.
2, the present invention is labeled a large amount of unlabeled data using semi-supervised learning algorithm, successfully overcomes existing doctor Treat the very few defect of industry labeled data, effectively improve the data volume that model is able to use, and be substantially improved algorithm for The mark accuracy rate of keyword and specialized vocabulary can be widely used in medical literature processing.
Detailed description of the invention
Fig. 1 is process flow diagram of the invention.
Specific embodiment
The embodiment of the present invention is further described below in conjunction with attached drawing.
Design philosophy of the invention: utilize machine learning algorithm and technology, and based on semi-supervised learning mask method for Medical files or case carry out the mark in specialized vocabulary.The neural network that the present invention constructs one three layers of layering is come pair Text is marked: (1) word in text carries out the feature extraction of vectorization using three kinds of modes, and BiLSTM, which is extracted, is based on word Female feature, Word2Vec do word insertion, and the feature extraction based on syntactic structure to word.(2) BiLSTM is extracted same In one sentence, the contextual information being centered around around word, and encoded.(3) CRF target is used in combination in CRF mark layer Function makes final label judgement to word and label tag modeling.
Based on above-mentioned design philosophy, medical files specialized vocabulary of the invention automates mask method, as shown in Figure 1, packet Include following steps:
Step 1: data prediction being carried out to the medical files of input, obtains pretreated medical files text.
In this step, it inputs as medical files, exports as word array.Data preprocessing method are as follows: to medical files It is segmented first, forms an array, store each word and punctuation mark in text, then remove stop words, such as is ", " but ", " shall ", " by " extract stem and lemmatization later, obtain the citation form of word.For example, running, Ran, runs after extracting stem, obtain run word, and lemmatization is substantially similar, any type of vocabulary can be reduced to General type, the word array not marked for obtaining being made of general type by data prediction.
Step 2: modeling text using BiLSTM, obtain the alphabetical grade feature vector of word.
The input of this step be carry out data prediction after word array, export for the feature based on alphabetic feature to Amount, length 5d.
The present invention encodes alphabetic feature using biLSTM, referred to as Character-BasedEmbedding. The alphabetical grade feature of word is generated by propagating forward for BiLSTM with the hidden layer vector in calculating process backward, and building is based on character Embeding layer the advantages of be that it can handle vocabulary and formula except vocabulary, these vocabulary and formula in these data very Common, the feature vector length of generation is set as d.But the present invention using stem 5-gram (i.e. take word from a left side Turn right, 5 most started letter encodes, if without 5 letters, remaining length makees zero padding processing) feature is extracted, finally Feature vector length is 5d.
Step 3: modeling text using word2vec, obtain the word level feature vector of word.
In this step, it is mapped to vector space using the word of fixed vocabulary (marking plus unknown words), made Initialized with the Word2vec pre-training combined with different corpus, word using Google Word2Vec algorithm into Row coding, referred to as WordEmbedding finally show that the feature vector length for each word is d.
Step 4: being based on text language pragmatic Features, design obtains the language feature vector of word.
In this step, the word array only to be segmented to original text is inputted, is exported to be designed based on language feature Feature vector, length 21.Feature is not trained individually, is manual definition, referred to as FeatureEmbedding.The part is fixed The feature of justice includes: initial capital and small letter, and word all-lowercase, word all Caps, part of speech and syntactic structure etc. are 21 total Feature, being formed by feature vector length is 21, and each feature is indicated with 0 or 1, indicates them either with or without corresponding feature.
Step 5: alphabetical grade feature, word level feature and the language feature of the word that step 2,3,4 obtain being merged, made For the coding vector of word.
The input of this step be the feature vector of character level, the feature vector of word level, word language feature vector vector, Above-mentioned three feature vectors are linked together, formed a length be 5d+d+21 the synthesis for each word feature to Amount.
Step 6: the word of the electronic health record text after participle labeled data: being labeled as four class medicine entity (disease, diseases Shape, treatment means, medicine name), every one kind entity indicates specific location of the word in the entity with IOBES, is labeled as 20 altogether Class obtains labeled data collection.
In order to the span for two continuous key phrases for distinguishing same type, the present invention is each word in sentence Specified label specifies their positions and type in phrase.On the basis of preprocessed data, on each word mark Corresponding label, the position of phrase and corresponding classification, indicate their position, the present invention uniformly makes first where indicating them One word is described with IOBES (Inside, Outside, Begining, End and Singleton) in professional phrase or word Position in remittance, I indicate word in the inside of phrase, B indicate they in the beginning of phrase, E indicate they in the end of phrase, S indicates that this word is an individual specialized vocabulary, and 0 indicates to be included in sentence in the outside of the phrase.Simultaneously in conjunction with this The classification of these specialized vocabularies and phrase tagging, including disease name (disease), symptom (symptom) are treated hand by invention Section (treatment-method), medicine name (Drug-name) is comprehensive to form a complete label, for example, Criticialill patients (Critical Ill Patient) is noted as " B-symptom I-symptom E-symptom ".In this way The combination tag of formation just has altogether 20 classifications.Since training set quantity is very huge, the present invention is labelled with a portion Data thus constitute in data set labeled data and unlabeled data.
Step 7: using the coding vector of the word in text pretreated in step 1 and step 5 as the defeated of biLSTM Enter, set 20 for output, space representation of the vector of one 20 dimension as word is exported to each word.
In this step, the assemblage characteristic vector that the three kinds of features obtained using step 5 are formed, and by entire word array All feature vectors arrange, formed training data matrix, the quantity of the row of matrix is the quantity of the word in word array, Matrix column number is 5d+d+21, and the present invention uses biLSTM, is passed by the hidden layer of forwardly and rearwardly calculating process as input Linear layer is passed, which, by dimensional projections to the size in tag types space, is 20, and is used as CRF layers of input.
Step 8: data enhancing being carried out to the labeled data collection that step 6 obtains, extends mark number using label propagation algorithm Enhanced labeled data collection is obtained according to collection.
This step includes two parts: first part is based on feature vector structure figures corresponding to word, and as in figure Node, their distance and weight w are defined using the similarity between feature vectoruv, the sum of figure interior joint is not equal to The sum of flag data and marked data.Second part is using label propagation algorithm, which is intended to minimize by optimization The objective function of Kullback- Leibler distance is distributed the label between adjacent node as similar to each other as possible.The part The corresponding word of all figure interior joints is marked, enhanced data set is obtained.Method particularly includes:
First part constructs relational graph required for label propagation algorithm first, the feature vector institute on the vertex and word in figure It is corresponding, while being the distance between word feature, capture Semantic Similarity.The total size of figure is equal to flag data amount VlWith it is unmarked Data volume VuThe sum of.With the one group of word trained in advance insertion, (dimension d) modeling, wherein 5-gram is with 5, head of current word Letter, the word closest to verb are embedded in, and one including part group of part of speech label and capital and small letter (43 and 4 dimensions One-hot vector).Then the gained eigenvector projection that length is 5d+d+43+4 is tieed up to 100 using PCA dimension-reduction algorithm. The weight w on the side between definition node u and v of the present inventionuvIt is as follows: wuv=de(u, v) ifv ∈ κ (u) oru ∈ κ (v), wherein κ (u) It is the set of the k-nearest neighbours of u, de(u, v) is the Euclidean distance in figure between any two node u and v.
For each node i in figure, the present invention calculates marginal probability { q using forwardly and rearwardly calculating processi}.If θi Indicate the estimation of CRF parameter after n-th iteration, the present invention is for each mark position in sentence j in label and Unlabeled data The IOBES label of i calculates marginal probability
Second part enhances data using label propagation algorithm, unlabeled data in data set is labeled, the calculation Method is intended to minimize the objective function of Kullback-Leibler distance by optimization, is distributed the label between adjacent node to the greatest extent May be similar to each other, by minimizing following Kullback-Leibler distance: i) for the node of words all in figure, reference numerals According to distribution ruQ is distributed with the label of predictionu.Ii) all node u and its adjacent node v, q in figureuAnd qvDistribution.Iii) institute It is distributed the q of nodeuWith CRF marginal probabilityIf node is not attached to marked vertex, Section 3 advises prediction distribution Then turn to CRF prediction, it is ensured that algorithm is good at least as standard self training.
Step 9: using the space representation of the word of 20 dimensions of step 7 as the vector of word, the enhanced mark that step 8 is obtained Note data set input condition random field is trained modeling.
Using the space representation of the word of 20 dimensions of step 7 as the vector of word, biLSTM eventually exports a mark matrix P, P matrix have included the probability distribution for each label substantially, are then poured into CRF layers, can obtain a mark Sequences y passes through the score φ (y of method sequence of calculation y;X, θ), then calculate what annotated sequence y occurred in all annotated sequences Probability Pθ(y | x), finally using backpropagation for objective functionIt is maximized, with Supervised learning is completed, while the CRF model is also exported as final result.The specific method is as follows:
Keyword classification be it is a kind of between output label there are the task of strong dependency for example, I-disease cannot be Behind B-treatment-method, therefore, the present invention does not make independent label decision to each output, uses item Part random field carries out joint modeling to them.For inputting sentence x=(x1, x2, x3 ..., xn), it is considered herein that P is The score matrix of biLSTM network output.The size of P is n × m, and wherein n is the reference numerals in sentence, and m is the number of not isolabeling Amount.PT, iThe score for i-th of label for dying a word corresponding in sentence the.The present invention is using first-order Markov model and determines Adopted transition matrix T, wherein TI, jTable is from label i to the score of label j.The present invention increases y simultaneously0And ynAs beginning and end Identifier.Therefore the dimension of T matrix becomes m+2.
Score is defined as by a given possible output y and neural network parameter θ, the present invention
The probability of sequences y is obtained by applying softmax on all possible sequence label:
Pθ(y | x)=exp (φ (y;X, θ))/∑y′∈Yexp(φ(y′;X, θ))
Wherein Y indicates all possible sequence label.Normalization item can be effectively calculated using forwards algorithms.
Initial training finally is carried out using the label data in data set, during the training period, the present invention maximizes given Log-probabilityL (the Y of the correct sequence label of corpus { X, Y };X, θ).Simultaneously based on the gross score for using sentence The gradient of calculating completes backpropagation.
After obtaining trained CRF algorithm, by it in conjunction with characteristic extraction part constructed before, so that it may start pair It is labeled in text.Input a sentence x=(x1, x2, x3..., xn), an annotated sequence y=(y can be obtained1, y2, y3..., yn)。
Meaning of parameters in above-mentioned formula is described as follows:
The present invention does not address place and is suitable for the prior art.
It is emphasized that embodiment of the present invention be it is illustrative, without being restrictive, therefore packet of the present invention Include and be not limited to embodiment described in specific embodiment, it is all by those skilled in the art according to the technique and scheme of the present invention The other embodiments obtained, also belong to the scope of protection of the invention.

Claims (10)

1. a kind of medical files specialized vocabulary automates mask method, it is characterised in that the following steps are included:
Step 1 carries out data prediction to the medical files of input, obtains pretreated medical files text;
Step 2 models text using biLSTM, obtains the alphabetical grade feature vector of word;
Step 3 models text using word2vec, obtains the word level feature vector of word;
Step 4 is based on text language pragmatic Features, obtains the language feature vector of word;
Step 5, alphabetical grade feature vector, word level feature vector and the language feature that step 2, step 3 and step 4 are obtained to word Vector is merged, the coding vector as word;
The word of medical files text after participle is labeled as following four classes medicine entity by step 6: disease name, disease symptoms, Treatment means and medicine name, every one kind entity indicate specific location of the word in the entity with IOBES, obtain labeled data Collection;
Step 7, using the coding vector of the text that step 1 obtains and the word that step 5 obtains as the input of biLSTM, to each A word exports space representation of the multi-C vector as word;
Step 8 obtains enhanced labeled data collection using label propagation algorithm extension labeled data collection;
Step 9, using the multi-C vector of step 7 as the space representation of word as the vector of word, step 8 is obtained enhanced Labeled data collection input condition random field is trained modeling, and final output annotation results.
2. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step Rapid 1 concrete methods of realizing are as follows: the medical files of input are segmented first, form an array, are stored every in text Then a word and punctuation mark remove stop words, finally extract stem and lemmatization, obtain the citation form of word, and structure At the word array not marked.
3. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step Rapid 2 concrete methods of realizing are as follows: encoded, made using alphabetical grade feature of the biLSTM to pretreated medical files text It is encoded with five letters of head of each word, finally show that length is the alphabetical grade feature vector of 5d.
4. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step Rapid 3 concrete methods of realizing are as follows: using the Word2Vec algorithm of Google to the word level of pretreated medical files text Feature is encoded, and the word level feature vector for each word that length is d is finally obtained.
5. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step Rapid 4 concrete methods of realizing are as follows: according to text language pragmatic Features, using manual definition method, to pretreated medicine text The shelves following feature of text definition: initial capital and small letter, word all-lowercase, word all Caps, part of speech and syntactic structure are formed The feature vector that length is 21, each feature are indicated with 0 or 1.
6. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step Rapid 5 concrete methods of realizing are as follows: alphabetical grade feature vector, word level feature vector and language feature vector are linked together, Form the feature vector for the synthesis for each word that a length is 5d+d+21.
7. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step Rapid 6 labeled data collection is the combination tag for including 20 classifications.
8. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step Rapid 7 concrete methods of realizing are as follows: the assemblage characteristic vector that the three kinds of features obtained using step 5 are formed, and by entire word number All feature vectors of group are arranged, and form training data matrix, the quantity of the row of the matrix is the word in word array Quantity, matrix column number are 5d+d+21;Using biLSTM, passed by the hidden layer of forwardly and rearwardly calculating process as input Linear layer is passed, which is 20 by the size of dimensional projections to tag types space, and is used as CRF layers of input.
9. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: the step Rapid 8 concrete methods of realizing are as follows: firstly, based on feature vector structure figures corresponding to word, and as the node in figure, it uses Similarity between feature vector defines their distance and weight wuv, the sum of figure interior joint is equal to Unlabeled data and The sum of flag data;Then, the target letter of Kullback-Leibler distance is minimized by optimization using label propagation algorithm Number is distributed the label between adjacent node as similar to each other as possible, marks the corresponding word of all figure interior joints Note, obtains enhanced data set.
10. a kind of medical files specialized vocabulary according to claim 1 automates mask method, it is characterised in that: described The concrete methods of realizing of step 9 are as follows: for the space representation of the word for the multidimensional for obtaining step 7 as the vector of word, biLSTM is final It includes the probability distribution for each label that mark matrix P, the P mark matrix, which can be exported, is poured into CRF layers, obtains Score φ (the y of an annotated sequence y out, sequence of calculation y;X, θ), then calculate annotated sequence y and occur in all annotated sequences Probability Pθ(y | x), finally using backpropagation for objective functionIt is maximized, To complete supervised learning, while the CRF model is exported as final result.
CN201910265223.3A 2019-04-03 2019-04-03 Medical document professional vocabulary automatic labeling method Active CN110059185B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910265223.3A CN110059185B (en) 2019-04-03 2019-04-03 Medical document professional vocabulary automatic labeling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910265223.3A CN110059185B (en) 2019-04-03 2019-04-03 Medical document professional vocabulary automatic labeling method

Publications (2)

Publication Number Publication Date
CN110059185A true CN110059185A (en) 2019-07-26
CN110059185B CN110059185B (en) 2022-10-04

Family

ID=67318293

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910265223.3A Active CN110059185B (en) 2019-04-03 2019-04-03 Medical document professional vocabulary automatic labeling method

Country Status (1)

Country Link
CN (1) CN110059185B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111063446A (en) * 2019-12-17 2020-04-24 医渡云(北京)技术有限公司 Method, apparatus, device and storage medium for standardizing medical text data
CN111651991A (en) * 2020-04-15 2020-09-11 天津科技大学 Medical named entity identification method utilizing multi-model fusion strategy
CN111666406A (en) * 2020-04-13 2020-09-15 天津科技大学 Short text classification prediction method based on word and label combination of self-attention
CN111797263A (en) * 2020-07-08 2020-10-20 北京字节跳动网络技术有限公司 Image label generation method, device, equipment and computer readable medium
CN111797612A (en) * 2020-05-15 2020-10-20 中国科学院软件研究所 Method for extracting automatic data function items
CN112101014A (en) * 2020-08-20 2020-12-18 淮阴工学院 Chinese chemical industry document word segmentation method based on mixed feature fusion
CN113297852A (en) * 2021-07-26 2021-08-24 北京惠每云科技有限公司 Medical entity word recognition method and device
CN113808752A (en) * 2020-12-04 2021-12-17 四川医枢科技股份有限公司 Medical document identification method, device and equipment
CN114386424A (en) * 2022-03-24 2022-04-22 上海帜讯信息技术股份有限公司 Industry professional text automatic labeling method, industry professional text automatic labeling device, industry professional text automatic labeling terminal and industry professional text automatic labeling storage medium
CN115292498A (en) * 2022-08-19 2022-11-04 北京华宇九品科技有限公司 Document classification method, system, computer equipment and storage medium
CN115563311A (en) * 2022-10-21 2023-01-03 中国能源建设集团广东省电力设计研究院有限公司 Document marking and knowledge base management method and knowledge base management system
CN115858819A (en) * 2023-01-29 2023-03-28 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Sample data augmentation method and device
TWI797537B (en) * 2020-01-13 2023-04-01 加拿大商知識研究有限公司 Method and system of using hierarchical vectorisation for representation of healthcare data
CN117034917A (en) * 2023-10-08 2023-11-10 中国医学科学院医学信息研究所 English text word segmentation method, device and computer readable medium
CN117095782A (en) * 2023-10-20 2023-11-21 上海森亿医疗科技有限公司 Medical text quick input method, system, terminal and editor

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8745093B1 (en) * 2000-09-28 2014-06-03 Intel Corporation Method and apparatus for extracting entity names and their relations
CN108491382A (en) * 2018-03-14 2018-09-04 四川大学 A kind of semi-supervised biomedical text semantic disambiguation method
CN108664473A (en) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 Recognition methods, electronic device and the readable storage medium storing program for executing of text key message
CN108829801A (en) * 2018-06-06 2018-11-16 大连理工大学 A kind of event trigger word abstracting method based on documentation level attention mechanism
CN108831559A (en) * 2018-06-20 2018-11-16 清华大学 A kind of Chinese electronic health record text analyzing method and system
CN109299262A (en) * 2018-10-09 2019-02-01 中山大学 A kind of text implication relation recognition methods for merging more granular informations

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8745093B1 (en) * 2000-09-28 2014-06-03 Intel Corporation Method and apparatus for extracting entity names and their relations
CN108491382A (en) * 2018-03-14 2018-09-04 四川大学 A kind of semi-supervised biomedical text semantic disambiguation method
CN108664473A (en) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 Recognition methods, electronic device and the readable storage medium storing program for executing of text key message
CN108829801A (en) * 2018-06-06 2018-11-16 大连理工大学 A kind of event trigger word abstracting method based on documentation level attention mechanism
CN108831559A (en) * 2018-06-20 2018-11-16 清华大学 A kind of Chinese electronic health record text analyzing method and system
CN109299262A (en) * 2018-10-09 2019-02-01 中山大学 A kind of text implication relation recognition methods for merging more granular informations

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
尚福华等: "一种半监督三维模型语义自动标注方法", 《计算机工程与应用》 *
林广和等: "基于细粒度词表示的命名实体识别研究", 《中文信息学报》 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111063446A (en) * 2019-12-17 2020-04-24 医渡云(北京)技术有限公司 Method, apparatus, device and storage medium for standardizing medical text data
TWI797537B (en) * 2020-01-13 2023-04-01 加拿大商知識研究有限公司 Method and system of using hierarchical vectorisation for representation of healthcare data
CN111666406A (en) * 2020-04-13 2020-09-15 天津科技大学 Short text classification prediction method based on word and label combination of self-attention
CN111666406B (en) * 2020-04-13 2023-03-31 天津科技大学 Short text classification prediction method based on word and label combination of self-attention
CN111651991B (en) * 2020-04-15 2022-08-26 天津科技大学 Medical named entity identification method utilizing multi-model fusion strategy
CN111651991A (en) * 2020-04-15 2020-09-11 天津科技大学 Medical named entity identification method utilizing multi-model fusion strategy
CN111797612A (en) * 2020-05-15 2020-10-20 中国科学院软件研究所 Method for extracting automatic data function items
CN111797263A (en) * 2020-07-08 2020-10-20 北京字节跳动网络技术有限公司 Image label generation method, device, equipment and computer readable medium
CN112101014A (en) * 2020-08-20 2020-12-18 淮阴工学院 Chinese chemical industry document word segmentation method based on mixed feature fusion
CN113808752A (en) * 2020-12-04 2021-12-17 四川医枢科技股份有限公司 Medical document identification method, device and equipment
CN113297852B (en) * 2021-07-26 2021-11-12 北京惠每云科技有限公司 Medical entity word recognition method and device
CN113297852A (en) * 2021-07-26 2021-08-24 北京惠每云科技有限公司 Medical entity word recognition method and device
CN114386424B (en) * 2022-03-24 2022-06-10 上海帜讯信息技术股份有限公司 Industry professional text automatic labeling method, industry professional text automatic labeling device, industry professional text automatic labeling terminal and industry professional text automatic labeling storage medium
CN114386424A (en) * 2022-03-24 2022-04-22 上海帜讯信息技术股份有限公司 Industry professional text automatic labeling method, industry professional text automatic labeling device, industry professional text automatic labeling terminal and industry professional text automatic labeling storage medium
CN115292498A (en) * 2022-08-19 2022-11-04 北京华宇九品科技有限公司 Document classification method, system, computer equipment and storage medium
CN115563311A (en) * 2022-10-21 2023-01-03 中国能源建设集团广东省电力设计研究院有限公司 Document marking and knowledge base management method and knowledge base management system
CN115563311B (en) * 2022-10-21 2023-09-15 中国能源建设集团广东省电力设计研究院有限公司 Document labeling and knowledge base management method and knowledge base management system
CN115858819A (en) * 2023-01-29 2023-03-28 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Sample data augmentation method and device
CN115858819B (en) * 2023-01-29 2023-05-16 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Sample data amplification method and device
CN117034917A (en) * 2023-10-08 2023-11-10 中国医学科学院医学信息研究所 English text word segmentation method, device and computer readable medium
CN117034917B (en) * 2023-10-08 2023-12-22 中国医学科学院医学信息研究所 English text word segmentation method, device and computer readable medium
CN117095782A (en) * 2023-10-20 2023-11-21 上海森亿医疗科技有限公司 Medical text quick input method, system, terminal and editor
CN117095782B (en) * 2023-10-20 2024-02-06 上海森亿医疗科技有限公司 Medical text quick input method, system, terminal and editor

Also Published As

Publication number Publication date
CN110059185B (en) 2022-10-04

Similar Documents

Publication Publication Date Title
CN110059185A (en) A kind of medical files specialized vocabulary automation mask method
CN111241294B (en) Relationship extraction method of graph convolution network based on dependency analysis and keywords
CN109753660B (en) LSTM-based winning bid web page named entity extraction method
CN109871538A (en) A kind of Chinese electronic health record name entity recognition method
CN109508459B (en) Method for extracting theme and key information from news
CN111966917A (en) Event detection and summarization method based on pre-training language model
CN111858940B (en) Multi-head attention-based legal case similarity calculation method and system
CN111222318B (en) Trigger word recognition method based on double-channel bidirectional LSTM-CRF network
CN110532328A (en) A kind of text concept figure building method
CN111400455A (en) Relation detection method of question-answering system based on knowledge graph
CN112966525B (en) Law field event extraction method based on pre-training model and convolutional neural network algorithm
CN110413768A (en) A kind of title of article automatic generation method
CN110750646B (en) Attribute description extracting method for hotel comment text
CN110781290A (en) Extraction method of structured text abstract of long chapter
CN112800184B (en) Short text comment emotion analysis method based on Target-Aspect-Opinion joint extraction
CN113360667B (en) Biomedical trigger word detection and named entity identification method based on multi-task learning
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN112101014A (en) Chinese chemical industry document word segmentation method based on mixed feature fusion
CN115545021A (en) Clinical term identification method and device based on deep learning
CN115935995A (en) Knowledge graph generation-oriented non-genetic-fabric-domain entity relationship extraction method
CN116049394A (en) Long text similarity comparison method based on graph neural network
CN110929518A (en) Text sequence labeling algorithm using overlapping splitting rule
CN112800244B (en) Method for constructing knowledge graph of traditional Chinese medicine and national medicine
CN114265936A (en) Method for realizing text mining of science and technology project
CN117236338B (en) Named entity recognition model of dense entity text and training method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240408

Address after: Room 1518B, Unit 2, 12th Floor, Huizhi Building, No. 9 Xueqing Road, Haidian District, Beijing, 100080

Patentee after: Beijing contention Technology Co.,Ltd.

Country or region after: China

Address before: No.9, 13th Street, economic and Technological Development Zone, Binhai New Area, Tianjin

Patentee before: TIANJIN University OF SCIENCE AND TECHNOLOGY

Country or region before: China

TR01 Transfer of patent right