CN110059185B - Medical document professional vocabulary automatic labeling method - Google Patents
Medical document professional vocabulary automatic labeling method Download PDFInfo
- Publication number
- CN110059185B CN110059185B CN201910265223.3A CN201910265223A CN110059185B CN 110059185 B CN110059185 B CN 110059185B CN 201910265223 A CN201910265223 A CN 201910265223A CN 110059185 B CN110059185 B CN 110059185B
- Authority
- CN
- China
- Prior art keywords
- word
- words
- medical
- following
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000002372 labelling Methods 0.000 title claims abstract description 41
- 239000013598 vector Substances 0.000 claims abstract description 73
- 238000012549 training Methods 0.000 claims abstract description 11
- 238000007781 pre-processing Methods 0.000 claims abstract description 8
- 230000011218 segmentation Effects 0.000 claims abstract description 4
- 238000000034 method Methods 0.000 claims description 39
- 239000011159 matrix material Substances 0.000 claims description 15
- 238000009826 distribution Methods 0.000 claims description 12
- 208000024891 symptom Diseases 0.000 claims description 10
- 201000010099 disease Diseases 0.000 claims description 9
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 6
- 239000003814 drug Substances 0.000 claims description 5
- 229940079593 drug Drugs 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 230000007547 defect Effects 0.000 abstract description 3
- 238000013461 design Methods 0.000 abstract description 3
- 238000000605 extraction Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to an automatic labeling method for medical document professional vocabularies, which comprises the following steps: performing data preprocessing on an input medical document to obtain a preprocessed medical document text; acquiring and fusing a letter-level feature vector, a word-level feature vector and a language feature vector of a word to serve as a coding vector of the word; classifying the word labels of the medical document texts after word segmentation to obtain a label data set; outputting a multidimensional vector as a spatial representation of the word for each word; acquiring an enhanced annotation data set; and training and modeling are carried out, and finally, a labeling result is output. The invention has reasonable design, adopts the semi-supervised learning algorithm to label a large amount of unlabelled data, successfully overcomes the defect of too little labeled data in the existing medical industry, effectively improves the data quantity which can be used by the model, greatly improves the labeling accuracy of the algorithm on keywords and professional vocabularies, and can be widely used in the medical literature treatment.
Description
Technical Field
The invention belongs to the technical field of machine learning, and particularly relates to an automatic labeling method for professional vocabularies of medical documents.
Background
With the development of the community of medical research, more and more papers are published every year. There is an increasing need to find improved methods for articles and to automatically understand the key ideas in these articles. However, due to the wide variety of fields and extremely limited annotation resources, extraction of scientific information is relatively rare.
Meanwhile, with the demand of people on medical resources and the corresponding increase of the number of medical documents and cases, researchers and medical staff need to quickly arrange past medical data of patients. Professional vocabularies or keywords are often used for quickly helping medical staff to make judgment from patient cases, much time is needed for manually arranging the vocabularies and the keywords, and arrangement of a large number of cases and medical data cannot be completed quickly due to manpower limitation.
In summary, with the rising demand for medical resources, how to automatically label professional words or keywords to increase the speed of medical care personnel in processing cases and medical data and help them better treat patients is a problem that needs to be solved urgently.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides an automatic labeling method for medical document professional vocabularies, which adopts a semi-supervised learning algorithm to expand data volume, overcomes the problem of poor model performance caused by insufficient data volume labeling of the traditional medical texts, and finally improves the accuracy of recognizing the professional vocabularies and keywords in the texts.
The technical problem to be solved by the invention is realized by adopting the following technical scheme:
an automatic labeling method for medical document professional vocabularies comprises the following steps:
step 1, performing data preprocessing on an input medical document to obtain a preprocessed medical document text;
step 2, using the biLSTM modeling text to obtain an alphabetical feature vector of the word;
step 3, modeling a text by using word2vec to obtain a word-level feature vector of the word;
step 4, obtaining language feature vectors of words based on the language characteristics of the text language;
step 5, fusing the letter-level feature vectors, the word-level feature vectors and the language feature vectors of the words obtained in the step 2, the step 3 and the step 4 to obtain coding vectors of the words;
step 6, marking the words of the medical document text after word segmentation as the following four types of medical entities: the method comprises the following steps of (1) obtaining a labeled data set by using disease names, disease symptoms, treatment means and drug names, wherein each entity uses an IOBES to represent the specific position of a word in the entity;
step 7, taking the text obtained in the step 1 and the encoding vector of the word obtained in the step 5 as the input of the biLSTM, and outputting a multi-dimensional vector as the space representation of the word for each word;
step 8, expanding the labeled data set by using a label propagation algorithm to obtain an enhanced labeled data set;
and 9, taking the multidimensional vector in the step 7 as a space representation of a word as a vector of the word, inputting the enhanced labeled data set obtained in the step 8 into a conditional random field for training and modeling, and finally outputting a labeling result.
Further, the specific implementation method of step 1 is as follows: firstly, segmenting input medical documents to form an array, storing each word and punctuation in the text, then removing stop words, finally extracting word stems and word shapes to restore to obtain basic forms of the words, and forming unmarked word arrays.
Further, the specific implementation method of step 2 is as follows: and (3) coding the letter-level features of the preprocessed medical document text by using biLSTM, and coding by using the first five letters of each word to finally obtain a letter-level feature vector with the length of 5d.
Further, the specific implementation method of step 3 is as follows: and (3) encoding the Word-level features of the preprocessed medical document text by using a Word2Vec algorithm of Google, and finally obtaining a Word-level feature vector with the length of d and aiming at each Word.
Further, the specific implementation method of step 4 is as follows: according to the language features of the text language, a manual definition method is adopted to define the following characteristics for the preprocessed medical document text: the first letter case, all lower case words, all upper case words, part of speech and grammatical structure form a length 21 feature vector, each feature being represented by 0 or 1.
Further, the specific implementation method of step 5 is as follows: the letter level feature vector, the word level feature vector and the language feature vector are connected together to form a comprehensive feature vector for each word with the length of 5d + 21.
Further, the labeled data set of step 6 is a combined label including 20 categories.
Further, the specific implementation method of step 7 is as follows: utilizing the combined feature vector formed by the three features obtained in the step 5, and arranging all feature vectors of the whole word array to form a training data matrix, wherein the number of rows of the matrix is the number of words in the word array, and the number of columns of the matrix is 5d + d +21; using biLSTM, the hidden layer through the forward and backward computation processes is passed as input to the linear layer, which projects the dimensions to the tag type space of size 20 and serves as input to the CRF layer.
Further, the specific implementation method of step 8 is as follows: firstly, constructing a graph based on feature vectors corresponding to words, defining the distance and the weight w of the words by using the similarity between the feature vectors as nodes in the graph uv The total number of nodes in the graph is equal to the sum of the unmarked data and the marked data; and then, optimizing an objective function of minimizing the Kullback-Leibler distance by using a label propagation algorithm to enable label distribution between adjacent nodes to be similar to each other as much as possible, and finally enabling words corresponding to the nodes in all the graphs to be labeled to obtain an enhanced data set.
Further, the specific implementation method of step 9 is as follows: taking the space representation of the multidimensional words obtained in the step 7 as vectors of the words, the biLSTM finally outputs a labeling matrix P, the P labeling matrix comprises probability distribution of each label, the probability distribution is poured into a CRF layer to obtain a labeling sequence y, and a labeling sequence y is calculatedThe score phi (y; x, theta) of the sequence y is calculated, and the probability P of the occurrence of the labeled sequence y in all labeled sequences is calculated θ (y | x) finally using back propagation for the objective function log-Maximization is performed to complete supervised learning, while the CRF model is output as the final result.
The invention has the advantages and positive effects that:
1. the invention divides the keywords in the medical literature into four categories of disease name (disease), symptom (symptom), treatment means (treatment-method) and Drug name (Drug-name), and labels the medical document or case on the professional vocabulary based on the semi-supervised learning labeling method, so that the medical staff or the student can quickly understand the content in the text under the condition of extremely low manpower and material resource consumption, and can make medical decision or research better.
2. The invention adopts the semi-supervised learning algorithm to label a large amount of unlabelled data, successfully overcomes the defect of too little labeled data in the existing medical industry, effectively improves the data quantity which can be used by the model, greatly improves the labeling accuracy of the algorithm on keywords and professional vocabularies, and can be widely used in the medical literature treatment.
Drawings
FIG. 1 is a process flow diagram of the present invention.
Detailed Description
The embodiments of the present invention will be described in detail with reference to the accompanying drawings.
The design idea of the invention is as follows: and labeling medical documents or cases on professional vocabularies by using a machine learning algorithm and technology and based on a semi-supervised learning labeling method. The invention constructs a three-layer hierarchical neural network to mark texts: (1) Words in the text are subjected to vectorization feature extraction in three ways, bilSTM extraction is based on letter features, word2Vec performs Word embedding on the words, and feature extraction is based on a grammatical structure. (2) BilSTM extracts the context information surrounding the word in the same sentence and encodes it. (3) The CRF labeling layer jointly uses a CRF objective function to model words and label labels and makes a final label judgment.
Based on the above design concept, the automatic labeling method for medical document professional vocabulary of the invention, as shown in fig. 1, comprises the following steps:
step 1: and carrying out data preprocessing on the input medical document to obtain a preprocessed medical document text.
In this step, the input is a medical document and the output is an array of words. The data preprocessing method comprises the following steps: the method comprises the steps of segmenting a medical document to form an array, storing each word and punctuation marks in a text, removing stop words such as is, but, shall and by, extracting word stems and word shapes, and recovering to obtain basic forms of the words. For example, run, ran, runs, after extracting the stem, get run words, the morphological reduction is basically similar, any form of vocabulary can be reduced to a general form, and the data preprocessing is performed to get the unmarked word array composed of the general form.
Step 2: the text is modeled using BilSTM, resulting in letter-level feature vectors for words.
The input of the step is a word array after data preprocessing, the output is a characteristic vector based on letter characteristics, and the length is 5d.
The present invention uses biLSTM to encode letter features, called Character-based embedding. The alphabetical features of a word are generated by the hidden layer vector during forward propagation and backward computation of BilSTM, and the advantage of building a character-based embedding layer is that it can handle words and formulas outside of the vocabulary, which are common in these data, with the generated feature vector length set to d. However, the invention adopts the head 5-gram (i.e. the first 5 letters are coded from left to right of the word, if there are no 5 letters, the remaining length is zero-filled), and the final feature vector length is 5d.
And step 3: text is modeled using word2vec, resulting in word-level feature vectors for words.
In this step, words using a fixed vocabulary (plus unknown Word tokens) are mapped to vector space, initialized using Word2Vec pre-training with different corpus combinations, the words are encoded using Google's Word2Vec algorithm, called wordlenbedding, and finally the length of the feature vector for each Word is obtained as d.
And 4, step 4: and designing to obtain the language feature vector of the word based on the language characteristics of the text language.
In this step, the input is a word array for performing word segmentation only on the original text, and the output is a feature vector designed based on language features, and the length is 21. Features are not trained separately, are defined manually, and are called FeatureEmbedding. The features defined in this section include: the initial case, all the lower case words, all the upper case words, part of speech and grammatical structure of the words total 21 features, the length of the formed feature vector is 21, and each feature is represented by 0 or 1 to indicate whether the corresponding feature exists or not.
And 5: and (4) fusing the letter-level features, the word-level features and the language features of the words obtained in the steps 2, 3 and 4 to obtain the encoding vectors of the words.
The input of the step is character level feature vector, word level feature vector and word language feature vector, and the three feature vectors are connected together to form a comprehensive feature vector for each word with the length of 5d + 21.
Step 6: labeling data: and marking words of the segmented electronic medical record text as four types of medical entities (diseases, symptoms, treatment means and medicine names), wherein each type of entity represents the specific position of the word in the entity by IOBES and is marked as 20 types in total to obtain a marked data set.
In order to be able to distinguish the span of two consecutive key phrases of the same type, the present invention assigns a tag to each word in the sentence, specifying their position and type in the phrase. On the basis of preprocessing data, each word is labeled with a corresponding label, the position of the phrase where the word is located and the corresponding category are represented, the position of the phrase is firstly labeled, IOBES (Inside, outlide, learning, end and Singleton) is used uniformly to describe the position of a word in a professional phrase or a vocabulary, I represents that the word is in the interior of the phrase, B represents that the word is in the beginning of the phrase, E represents that the word is in the End of the phrase, S represents that the word is a single professional vocabulary, and 0 represents that the word is in the exterior of the phrase and is contained in a sentence. In conjunction with the present invention, these specialized words and phrases are labeled with categories including disease name (disease), symptom (symptom), treatment (treatment-method), and Drug name (Drug-name), and combined to form a complete label, for example, criterialisll patents are labeled "B-symptom I-symptom E-symptom". The combined tags thus formed have a total of 20 categories. Because the training set is very huge in quantity, part of data is labeled, and labeled data and unlabeled data in the data set are formed.
And 7: and (3) taking the text preprocessed in the step (1) and the encoding vector of the word in the step (5) as the input of the biLSTM, setting the output to be 20, and outputting a 20-dimensional vector to each word as the space representation of the word.
In the step, a combined feature vector formed by the three features obtained in the step 5 is utilized, all feature vectors of the whole word array are arranged to form a training data matrix, the number of rows of the matrix is the number of words in the word array, and the number of columns of the matrix is 5d + d + 21.
And 8: and 6, performing data enhancement on the labeling data set obtained in the step 6, and expanding the labeling data set by using a label propagation algorithm to obtain an enhanced labeling data set.
The method comprises the following two steps: the first part is to construct a graph based on the feature vectors corresponding to the words, and as nodes in the graph, the similarity between the feature vectors is used to define their distance and weight w uv The total number of nodes in the graph is equal to the sum of the unmarked data and the marked data. The second part is the use of label propagationAn algorithm that aims to make the label distributions between neighboring nodes as similar to each other as possible by optimizing an objective function that minimizes the Kullback-Leibler distance. And finally, marking words corresponding to the nodes in all the graphs to obtain the enhanced data set. The specific method comprises the following steps:
the first part is to construct a relation graph required by a label propagation algorithm, wherein the vertex in the graph corresponds to the feature vector of the word, and the edge is the distance between the word features to capture semantic similarity. The total size of the graph is equal to the amount of label data V l And amount of unlabeled data V u And (4) summing. Modeling with a set of pre-trained word insertions (dimension d), where the 5-gram is embedded with the first 5 letters of the current word, the word closest to the verb, and a set of labels and cases of part-of-speech (43 and 4-dimensional one-hot vectors). The resulting feature vector of length 5d + d +43+4 is then projected to 100 dimensions using the PCA dimension reduction algorithm. The invention defines the weight w of the edge between nodes u and v uv The following were used: w is a uv =d e (u, v) ifv ∈ κ (u) oru ∈ κ (v), where κ (u) is the set of k-nearest neighbors of u, d e (u, v) is the Euclidean distance between any two nodes u and v in the graph.
For each node i in the graph, the invention computes the edge probability using a forward and backward computation process q i }. Let θ i Representing the estimation of the CRF parameters after the nth iteration, the invention calculates the edge probability for the IOBES label of each label position i in the sentence j in the labeled and unlabeled data
The second part enhances the data, for annotation of unlabeled data in the dataset, using a label propagation algorithm which aims to make the label distributions between adjacent nodes as similar as possible to each other by optimizing an objective function that minimizes the Kullback-Leibler distance by minimizing the following Kullback-Leibler distance: i) For all the nodes of the word in the graph, the distribution r of the labeled data u And predicted label distribution q u . ii) all nodes u and their neighbors v, q in the graph u And q is v The distribution of (c). iii) Q for all distributed nodes u And CRF edge probabilityIf the node is not connected to a labeled vertex, the third term normalizes the prediction distribution into a CRF prediction, ensuring that the algorithm is at least as good as standard self-training.
And step 9: and (4) taking the space representation of the 20-dimensional words in the step (7) as vectors of the words, and inputting the enhanced labeling data set obtained in the step (8) into a conditional random field for training and modeling.
Taking the space representation of the 20-dimensional words in the step 7 as the vector of the words, the biLSTM finally outputs a labeling matrix P which basically comprises probability distribution of each label, then pouring the probability distribution into a CRF layer to obtain a labeling sequence y, calculating the score phi (y; x, theta) of the sequence y by the method, and then calculating the probability P of the labeling sequence y in all the labeling sequences θ (y | x), finally using back propagation for the objective functionMaximization is performed to complete supervised learning, and the CRF model is also output as a final result. The specific method comprises the following steps:
keyword classification is a task where there is a strong dependency between output labels, e.g., I-distance cannot be followed by B-stream-method, so the present invention does not make independent labeling decisions for each output, but jointly models them using conditional random fields. For the input sentence x = (x 1, x2, x 3., xn), we consider P to be the score matrix of the biLSTM network output. The size of P is n × m, where n is the number of tokens in the sentence and m is the number of different tokens. P t,i The score of the ith label corresponding to the first word to death in the sentence. The invention uses a first order Markov model and defines a transformation matrix T, where T i,j The table scores from label i to label j. The invention also increases y 0 And y n As start and end identifiers. Thus the dimension of T matrix becomesm+2。
Given a possible output y and a neural network parameter θ, the present invention defines the score as
The probability of sequence y is obtained by applying softmax on all possible tag sequences:
P θ (y|x)=exp(φ(y;x,θ))/∑ y′∈Y exp(φ(y′;x,θ))
where Y represents all possible tag sequences. The normalization term can be computed efficiently using a forward algorithm.
Finally, the label data in the dataset is used for preliminary training, during which the invention maximizes log-probability L (Y; X, theta) for the correct label sequence for a given corpus { X, Y }. While the back propagation is done based on a gradient calculated using the total score of the sentence.
After the trained CRF algorithm is obtained, the CRF algorithm is combined with the feature extraction part constructed before, and then the text can be labeled. I.e. inputting a sentence x = (x) 1 ,x 2 ,x 3 ...,x n ) Then, a labeled sequence y = (y) is obtained 1 ,y 2 ,y 3 ...,y n )。
The meaning of the parameters in the above formula is illustrated below:
nothing in this specification is said to apply to the prior art.
It should be emphasized that the embodiments described herein are illustrative rather than restrictive, and thus the present invention is not limited to the embodiments described in the detailed description, but also includes other embodiments that can be derived from the technical solutions of the present invention by those skilled in the art.
Claims (9)
1. A medical document professional vocabulary automatic labeling method is characterized by comprising the following steps:
step 1, performing data preprocessing on an input medical document to obtain a preprocessed medical document text;
step 2, using the biLSTM modeling text to obtain an alphabetical feature vector of the word;
step 3, modeling a text by using word2vec to obtain a word-level feature vector of the word;
step 4, obtaining language feature vectors of words based on the language characteristics of the text language;
step 5, fusing the letter-level feature vectors, the word-level feature vectors and the language feature vectors of the words obtained in the step 2, the step 3 and the step 4 to obtain coding vectors of the words;
and 6, marking the words of the medical document text after word segmentation as the following four medical entities: disease name, disease symptom, treatment means and drug name, wherein each entity uses IOBES to represent the concrete position of the word in the entity to obtain a labeled data set;
step 7, taking the text obtained in the step 1 and the encoding vector of the word obtained in the step 5 as the input of the biLSTM, and outputting a multi-dimensional vector as the space representation of the word for each word;
step 8, expanding the labeled data set by using a label propagation algorithm to obtain an enhanced labeled data set;
and 9, inputting the multidimensional vector obtained in the step 7 and the enhanced labeling data set obtained in the step 8 into a conditional random field for training and modeling, and finally outputting a labeling result.
2. The automatic labeling method for the professional vocabulary of the medical document according to claim 1, wherein: the specific implementation method of the step 1 comprises the following steps: firstly, segmenting the input medical document to form an array, storing each word and punctuation marks in the text, then removing stop words, finally extracting word stems and word shapes to restore to obtain the basic form of the words, and forming the unmarked word array.
3. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the specific implementation method of the step 2 comprises the following steps: and (3) coding the letter-level features of the preprocessed medical document text by using biLSTM, and coding by using the first five letters of each word to finally obtain a letter-level feature vector with the length of 5d.
4. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the specific implementation method of the step 3 comprises the following steps: and (3) encoding the Word-level features of the preprocessed medical document text by using a Word2Vec algorithm of Google, and finally obtaining a Word-level feature vector with the length of d and aiming at each Word.
5. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the specific implementation method of the step 4 comprises the following steps: according to the language features of the text language, a manual definition method is adopted to define the following characteristics for the preprocessed medical document text: the first letter case, all lower case words, all upper case words, part of speech and grammatical structure form a length 21 feature vector, each feature being represented by 0 or 1.
6. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the specific implementation method of the step 5 is as follows: the letter level eigenvector, the word level eigenvector, and the language eigenvector are concatenated together to form a comprehensive eigenvector for each word with length 5d + 21.
7. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the labeled data set of step 6 is a combined label comprising 20 categories.
8. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the specific implementation method of the step 7 is as follows: utilizing the combined feature vector formed by the three features obtained in the step 5, and arranging all feature vectors of the whole word array to form a training data matrix, wherein the number of rows of the matrix is the number of words in the word array, and the number of columns of the matrix is 5d + d +21; using bilSTM, the hidden layer through the forward and backward computation processes is passed as input to the linear layer, which projects dimensions to a size of 20 in the tag type space and serves as input to the CRF layer.
9. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the specific implementation method of the step 8 is as follows: firstly, constructing a graph based on feature vectors corresponding to words, defining the distance and the weight w of the words by using the similarity between the feature vectors as nodes in the graph uv The total number of nodes in the graph is equal to the sum of the unmarked data and the marked data; and then, optimizing an objective function of minimizing the Kullback-Leibler distance by using a label propagation algorithm to enable label distribution between adjacent nodes to be similar to each other as much as possible, and finally enabling words corresponding to the nodes in all the graphs to be labeled to obtain an enhanced data set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910265223.3A CN110059185B (en) | 2019-04-03 | 2019-04-03 | Medical document professional vocabulary automatic labeling method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910265223.3A CN110059185B (en) | 2019-04-03 | 2019-04-03 | Medical document professional vocabulary automatic labeling method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110059185A CN110059185A (en) | 2019-07-26 |
CN110059185B true CN110059185B (en) | 2022-10-04 |
Family
ID=67318293
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910265223.3A Active CN110059185B (en) | 2019-04-03 | 2019-04-03 | Medical document professional vocabulary automatic labeling method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110059185B (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111063446B (en) * | 2019-12-17 | 2023-06-16 | 医渡云(北京)技术有限公司 | Method, apparatus, device and storage medium for standardizing medical text data |
WO2021142534A1 (en) * | 2020-01-13 | 2021-07-22 | Knowtions Research Inc. | Method and system of using hierarchical vectorisation for representation of healthcare data |
CN111666406B (en) * | 2020-04-13 | 2023-03-31 | 天津科技大学 | Short text classification prediction method based on word and label combination of self-attention |
CN111651991B (en) * | 2020-04-15 | 2022-08-26 | 天津科技大学 | Medical named entity identification method utilizing multi-model fusion strategy |
CN111797612A (en) * | 2020-05-15 | 2020-10-20 | 中国科学院软件研究所 | Method for extracting automatic data function items |
CN111797263A (en) * | 2020-07-08 | 2020-10-20 | 北京字节跳动网络技术有限公司 | Image label generation method, device, equipment and computer readable medium |
CN112101014B (en) * | 2020-08-20 | 2022-07-26 | 淮阴工学院 | Chinese chemical industry document word segmentation method based on mixed feature fusion |
CN113808752A (en) * | 2020-12-04 | 2021-12-17 | 四川医枢科技股份有限公司 | Medical document identification method, device and equipment |
CN113297852B (en) * | 2021-07-26 | 2021-11-12 | 北京惠每云科技有限公司 | Medical entity word recognition method and device |
CN114386424B (en) * | 2022-03-24 | 2022-06-10 | 上海帜讯信息技术股份有限公司 | Industry professional text automatic labeling method, industry professional text automatic labeling device, industry professional text automatic labeling terminal and industry professional text automatic labeling storage medium |
CN115563311B (en) * | 2022-10-21 | 2023-09-15 | 中国能源建设集团广东省电力设计研究院有限公司 | Document labeling and knowledge base management method and knowledge base management system |
CN115858819B (en) * | 2023-01-29 | 2023-05-16 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Sample data amplification method and device |
CN117034917B (en) * | 2023-10-08 | 2023-12-22 | 中国医学科学院医学信息研究所 | English text word segmentation method, device and computer readable medium |
CN117095782B (en) * | 2023-10-20 | 2024-02-06 | 上海森亿医疗科技有限公司 | Medical text quick input method, system, terminal and editor |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8745093B1 (en) * | 2000-09-28 | 2014-06-03 | Intel Corporation | Method and apparatus for extracting entity names and their relations |
CN108491382A (en) * | 2018-03-14 | 2018-09-04 | 四川大学 | A kind of semi-supervised biomedical text semantic disambiguation method |
CN108664473A (en) * | 2018-05-11 | 2018-10-16 | 平安科技(深圳)有限公司 | Recognition methods, electronic device and the readable storage medium storing program for executing of text key message |
CN108829801A (en) * | 2018-06-06 | 2018-11-16 | 大连理工大学 | A kind of event trigger word abstracting method based on documentation level attention mechanism |
CN108831559A (en) * | 2018-06-20 | 2018-11-16 | 清华大学 | A kind of Chinese electronic health record text analyzing method and system |
CN109299262A (en) * | 2018-10-09 | 2019-02-01 | 中山大学 | A kind of text implication relation recognition methods for merging more granular informations |
-
2019
- 2019-04-03 CN CN201910265223.3A patent/CN110059185B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8745093B1 (en) * | 2000-09-28 | 2014-06-03 | Intel Corporation | Method and apparatus for extracting entity names and their relations |
CN108491382A (en) * | 2018-03-14 | 2018-09-04 | 四川大学 | A kind of semi-supervised biomedical text semantic disambiguation method |
CN108664473A (en) * | 2018-05-11 | 2018-10-16 | 平安科技(深圳)有限公司 | Recognition methods, electronic device and the readable storage medium storing program for executing of text key message |
CN108829801A (en) * | 2018-06-06 | 2018-11-16 | 大连理工大学 | A kind of event trigger word abstracting method based on documentation level attention mechanism |
CN108831559A (en) * | 2018-06-20 | 2018-11-16 | 清华大学 | A kind of Chinese electronic health record text analyzing method and system |
CN109299262A (en) * | 2018-10-09 | 2019-02-01 | 中山大学 | A kind of text implication relation recognition methods for merging more granular informations |
Non-Patent Citations (2)
Title |
---|
一种半监督三维模型语义自动标注方法;尚福华等;《计算机工程与应用》;20111114(第06期);全文 * |
基于细粒度词表示的命名实体识别研究;林广和等;《中文信息学报》;20181115(第11期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110059185A (en) | 2019-07-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110059185B (en) | Medical document professional vocabulary automatic labeling method | |
CN110008469B (en) | Multilevel named entity recognition method | |
WO2021139424A1 (en) | Text content quality evaluation method, apparatus and device, and storage medium | |
CN111241294B (en) | Relationship extraction method of graph convolution network based on dependency analysis and keywords | |
CN109753660B (en) | LSTM-based winning bid web page named entity extraction method | |
CN111966917A (en) | Event detection and summarization method based on pre-training language model | |
CN112784051A (en) | Patent term extraction method | |
CN111027595A (en) | Double-stage semantic word vector generation method | |
CN112966525B (en) | Law field event extraction method based on pre-training model and convolutional neural network algorithm | |
CN110781290A (en) | Extraction method of structured text abstract of long chapter | |
CN110750646B (en) | Attribute description extracting method for hotel comment text | |
CN112800184B (en) | Short text comment emotion analysis method based on Target-Aspect-Opinion joint extraction | |
CN115687626A (en) | Legal document classification method based on prompt learning fusion key words | |
CN114386417A (en) | Chinese nested named entity recognition method integrated with word boundary information | |
CN115310448A (en) | Chinese named entity recognition method based on combining bert and word vector | |
CN112101014A (en) | Chinese chemical industry document word segmentation method based on mixed feature fusion | |
CN117010387A (en) | Roberta-BiLSTM-CRF voice dialogue text naming entity recognition system integrating attention mechanism | |
CN115238693A (en) | Chinese named entity recognition method based on multi-word segmentation and multi-layer bidirectional long-short term memory | |
CN112989830B (en) | Named entity identification method based on multiple features and machine learning | |
CN113191150B (en) | Multi-feature fusion Chinese medical text named entity identification method | |
CN117236338B (en) | Named entity recognition model of dense entity text and training method thereof | |
Venkataramana et al. | Abstractive text summarization using bart | |
CN115809666B (en) | Named entity recognition method integrating dictionary information and attention mechanism | |
CN115358227A (en) | Open domain relation joint extraction method and system based on phrase enhancement | |
CN114661912A (en) | Knowledge graph construction method, device and equipment based on unsupervised syntactic analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20240408 Address after: Room 1518B, Unit 2, 12th Floor, Huizhi Building, No. 9 Xueqing Road, Haidian District, Beijing, 100080 Patentee after: Beijing contention Technology Co.,Ltd. Country or region after: China Address before: No.9, 13th Street, economic and Technological Development Zone, Binhai New Area, Tianjin Patentee before: TIANJIN University OF SCIENCE AND TECHNOLOGY Country or region before: China |
|
TR01 | Transfer of patent right |