CN110059185B - Medical document professional vocabulary automatic labeling method - Google Patents

Medical document professional vocabulary automatic labeling method Download PDF

Info

Publication number
CN110059185B
CN110059185B CN201910265223.3A CN201910265223A CN110059185B CN 110059185 B CN110059185 B CN 110059185B CN 201910265223 A CN201910265223 A CN 201910265223A CN 110059185 B CN110059185 B CN 110059185B
Authority
CN
China
Prior art keywords
word
words
medical
following
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910265223.3A
Other languages
Chinese (zh)
Other versions
CN110059185A (en
Inventor
王嫄
高铭
王栋
赵婷婷
赵青
陈亚瑞
史艳翠
孔娜
王洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Contention Technology Co ltd
Original Assignee
Tianjin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University of Science and Technology filed Critical Tianjin University of Science and Technology
Priority to CN201910265223.3A priority Critical patent/CN110059185B/en
Publication of CN110059185A publication Critical patent/CN110059185A/en
Application granted granted Critical
Publication of CN110059185B publication Critical patent/CN110059185B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to an automatic labeling method for medical document professional vocabularies, which comprises the following steps: performing data preprocessing on an input medical document to obtain a preprocessed medical document text; acquiring and fusing a letter-level feature vector, a word-level feature vector and a language feature vector of a word to serve as a coding vector of the word; classifying the word labels of the medical document texts after word segmentation to obtain a label data set; outputting a multidimensional vector as a spatial representation of the word for each word; acquiring an enhanced annotation data set; and training and modeling are carried out, and finally, a labeling result is output. The invention has reasonable design, adopts the semi-supervised learning algorithm to label a large amount of unlabelled data, successfully overcomes the defect of too little labeled data in the existing medical industry, effectively improves the data quantity which can be used by the model, greatly improves the labeling accuracy of the algorithm on keywords and professional vocabularies, and can be widely used in the medical literature treatment.

Description

Medical document professional vocabulary automatic labeling method
Technical Field
The invention belongs to the technical field of machine learning, and particularly relates to an automatic labeling method for professional vocabularies of medical documents.
Background
With the development of the community of medical research, more and more papers are published every year. There is an increasing need to find improved methods for articles and to automatically understand the key ideas in these articles. However, due to the wide variety of fields and extremely limited annotation resources, extraction of scientific information is relatively rare.
Meanwhile, with the demand of people on medical resources and the corresponding increase of the number of medical documents and cases, researchers and medical staff need to quickly arrange past medical data of patients. Professional vocabularies or keywords are often used for quickly helping medical staff to make judgment from patient cases, much time is needed for manually arranging the vocabularies and the keywords, and arrangement of a large number of cases and medical data cannot be completed quickly due to manpower limitation.
In summary, with the rising demand for medical resources, how to automatically label professional words or keywords to increase the speed of medical care personnel in processing cases and medical data and help them better treat patients is a problem that needs to be solved urgently.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides an automatic labeling method for medical document professional vocabularies, which adopts a semi-supervised learning algorithm to expand data volume, overcomes the problem of poor model performance caused by insufficient data volume labeling of the traditional medical texts, and finally improves the accuracy of recognizing the professional vocabularies and keywords in the texts.
The technical problem to be solved by the invention is realized by adopting the following technical scheme:
an automatic labeling method for medical document professional vocabularies comprises the following steps:
step 1, performing data preprocessing on an input medical document to obtain a preprocessed medical document text;
step 2, using the biLSTM modeling text to obtain an alphabetical feature vector of the word;
step 3, modeling a text by using word2vec to obtain a word-level feature vector of the word;
step 4, obtaining language feature vectors of words based on the language characteristics of the text language;
step 5, fusing the letter-level feature vectors, the word-level feature vectors and the language feature vectors of the words obtained in the step 2, the step 3 and the step 4 to obtain coding vectors of the words;
step 6, marking the words of the medical document text after word segmentation as the following four types of medical entities: the method comprises the following steps of (1) obtaining a labeled data set by using disease names, disease symptoms, treatment means and drug names, wherein each entity uses an IOBES to represent the specific position of a word in the entity;
step 7, taking the text obtained in the step 1 and the encoding vector of the word obtained in the step 5 as the input of the biLSTM, and outputting a multi-dimensional vector as the space representation of the word for each word;
step 8, expanding the labeled data set by using a label propagation algorithm to obtain an enhanced labeled data set;
and 9, taking the multidimensional vector in the step 7 as a space representation of a word as a vector of the word, inputting the enhanced labeled data set obtained in the step 8 into a conditional random field for training and modeling, and finally outputting a labeling result.
Further, the specific implementation method of step 1 is as follows: firstly, segmenting input medical documents to form an array, storing each word and punctuation in the text, then removing stop words, finally extracting word stems and word shapes to restore to obtain basic forms of the words, and forming unmarked word arrays.
Further, the specific implementation method of step 2 is as follows: and (3) coding the letter-level features of the preprocessed medical document text by using biLSTM, and coding by using the first five letters of each word to finally obtain a letter-level feature vector with the length of 5d.
Further, the specific implementation method of step 3 is as follows: and (3) encoding the Word-level features of the preprocessed medical document text by using a Word2Vec algorithm of Google, and finally obtaining a Word-level feature vector with the length of d and aiming at each Word.
Further, the specific implementation method of step 4 is as follows: according to the language features of the text language, a manual definition method is adopted to define the following characteristics for the preprocessed medical document text: the first letter case, all lower case words, all upper case words, part of speech and grammatical structure form a length 21 feature vector, each feature being represented by 0 or 1.
Further, the specific implementation method of step 5 is as follows: the letter level feature vector, the word level feature vector and the language feature vector are connected together to form a comprehensive feature vector for each word with the length of 5d + 21.
Further, the labeled data set of step 6 is a combined label including 20 categories.
Further, the specific implementation method of step 7 is as follows: utilizing the combined feature vector formed by the three features obtained in the step 5, and arranging all feature vectors of the whole word array to form a training data matrix, wherein the number of rows of the matrix is the number of words in the word array, and the number of columns of the matrix is 5d + d +21; using biLSTM, the hidden layer through the forward and backward computation processes is passed as input to the linear layer, which projects the dimensions to the tag type space of size 20 and serves as input to the CRF layer.
Further, the specific implementation method of step 8 is as follows: firstly, constructing a graph based on feature vectors corresponding to words, defining the distance and the weight w of the words by using the similarity between the feature vectors as nodes in the graph uv The total number of nodes in the graph is equal to the sum of the unmarked data and the marked data; and then, optimizing an objective function of minimizing the Kullback-Leibler distance by using a label propagation algorithm to enable label distribution between adjacent nodes to be similar to each other as much as possible, and finally enabling words corresponding to the nodes in all the graphs to be labeled to obtain an enhanced data set.
Further, the specific implementation method of step 9 is as follows: taking the space representation of the multidimensional words obtained in the step 7 as vectors of the words, the biLSTM finally outputs a labeling matrix P, the P labeling matrix comprises probability distribution of each label, the probability distribution is poured into a CRF layer to obtain a labeling sequence y, and a labeling sequence y is calculatedThe score phi (y; x, theta) of the sequence y is calculated, and the probability P of the occurrence of the labeled sequence y in all labeled sequences is calculated θ (y | x) finally using back propagation for the objective function log-
Figure RE-GDA0002062223120000021
Maximization is performed to complete supervised learning, while the CRF model is output as the final result.
The invention has the advantages and positive effects that:
1. the invention divides the keywords in the medical literature into four categories of disease name (disease), symptom (symptom), treatment means (treatment-method) and Drug name (Drug-name), and labels the medical document or case on the professional vocabulary based on the semi-supervised learning labeling method, so that the medical staff or the student can quickly understand the content in the text under the condition of extremely low manpower and material resource consumption, and can make medical decision or research better.
2. The invention adopts the semi-supervised learning algorithm to label a large amount of unlabelled data, successfully overcomes the defect of too little labeled data in the existing medical industry, effectively improves the data quantity which can be used by the model, greatly improves the labeling accuracy of the algorithm on keywords and professional vocabularies, and can be widely used in the medical literature treatment.
Drawings
FIG. 1 is a process flow diagram of the present invention.
Detailed Description
The embodiments of the present invention will be described in detail with reference to the accompanying drawings.
The design idea of the invention is as follows: and labeling medical documents or cases on professional vocabularies by using a machine learning algorithm and technology and based on a semi-supervised learning labeling method. The invention constructs a three-layer hierarchical neural network to mark texts: (1) Words in the text are subjected to vectorization feature extraction in three ways, bilSTM extraction is based on letter features, word2Vec performs Word embedding on the words, and feature extraction is based on a grammatical structure. (2) BilSTM extracts the context information surrounding the word in the same sentence and encodes it. (3) The CRF labeling layer jointly uses a CRF objective function to model words and label labels and makes a final label judgment.
Based on the above design concept, the automatic labeling method for medical document professional vocabulary of the invention, as shown in fig. 1, comprises the following steps:
step 1: and carrying out data preprocessing on the input medical document to obtain a preprocessed medical document text.
In this step, the input is a medical document and the output is an array of words. The data preprocessing method comprises the following steps: the method comprises the steps of segmenting a medical document to form an array, storing each word and punctuation marks in a text, removing stop words such as is, but, shall and by, extracting word stems and word shapes, and recovering to obtain basic forms of the words. For example, run, ran, runs, after extracting the stem, get run words, the morphological reduction is basically similar, any form of vocabulary can be reduced to a general form, and the data preprocessing is performed to get the unmarked word array composed of the general form.
Step 2: the text is modeled using BilSTM, resulting in letter-level feature vectors for words.
The input of the step is a word array after data preprocessing, the output is a characteristic vector based on letter characteristics, and the length is 5d.
The present invention uses biLSTM to encode letter features, called Character-based embedding. The alphabetical features of a word are generated by the hidden layer vector during forward propagation and backward computation of BilSTM, and the advantage of building a character-based embedding layer is that it can handle words and formulas outside of the vocabulary, which are common in these data, with the generated feature vector length set to d. However, the invention adopts the head 5-gram (i.e. the first 5 letters are coded from left to right of the word, if there are no 5 letters, the remaining length is zero-filled), and the final feature vector length is 5d.
And step 3: text is modeled using word2vec, resulting in word-level feature vectors for words.
In this step, words using a fixed vocabulary (plus unknown Word tokens) are mapped to vector space, initialized using Word2Vec pre-training with different corpus combinations, the words are encoded using Google's Word2Vec algorithm, called wordlenbedding, and finally the length of the feature vector for each Word is obtained as d.
And 4, step 4: and designing to obtain the language feature vector of the word based on the language characteristics of the text language.
In this step, the input is a word array for performing word segmentation only on the original text, and the output is a feature vector designed based on language features, and the length is 21. Features are not trained separately, are defined manually, and are called FeatureEmbedding. The features defined in this section include: the initial case, all the lower case words, all the upper case words, part of speech and grammatical structure of the words total 21 features, the length of the formed feature vector is 21, and each feature is represented by 0 or 1 to indicate whether the corresponding feature exists or not.
And 5: and (4) fusing the letter-level features, the word-level features and the language features of the words obtained in the steps 2, 3 and 4 to obtain the encoding vectors of the words.
The input of the step is character level feature vector, word level feature vector and word language feature vector, and the three feature vectors are connected together to form a comprehensive feature vector for each word with the length of 5d + 21.
Step 6: labeling data: and marking words of the segmented electronic medical record text as four types of medical entities (diseases, symptoms, treatment means and medicine names), wherein each type of entity represents the specific position of the word in the entity by IOBES and is marked as 20 types in total to obtain a marked data set.
In order to be able to distinguish the span of two consecutive key phrases of the same type, the present invention assigns a tag to each word in the sentence, specifying their position and type in the phrase. On the basis of preprocessing data, each word is labeled with a corresponding label, the position of the phrase where the word is located and the corresponding category are represented, the position of the phrase is firstly labeled, IOBES (Inside, outlide, learning, end and Singleton) is used uniformly to describe the position of a word in a professional phrase or a vocabulary, I represents that the word is in the interior of the phrase, B represents that the word is in the beginning of the phrase, E represents that the word is in the End of the phrase, S represents that the word is a single professional vocabulary, and 0 represents that the word is in the exterior of the phrase and is contained in a sentence. In conjunction with the present invention, these specialized words and phrases are labeled with categories including disease name (disease), symptom (symptom), treatment (treatment-method), and Drug name (Drug-name), and combined to form a complete label, for example, criterialisll patents are labeled "B-symptom I-symptom E-symptom". The combined tags thus formed have a total of 20 categories. Because the training set is very huge in quantity, part of data is labeled, and labeled data and unlabeled data in the data set are formed.
And 7: and (3) taking the text preprocessed in the step (1) and the encoding vector of the word in the step (5) as the input of the biLSTM, setting the output to be 20, and outputting a 20-dimensional vector to each word as the space representation of the word.
In the step, a combined feature vector formed by the three features obtained in the step 5 is utilized, all feature vectors of the whole word array are arranged to form a training data matrix, the number of rows of the matrix is the number of words in the word array, and the number of columns of the matrix is 5d + d + 21.
And 8: and 6, performing data enhancement on the labeling data set obtained in the step 6, and expanding the labeling data set by using a label propagation algorithm to obtain an enhanced labeling data set.
The method comprises the following two steps: the first part is to construct a graph based on the feature vectors corresponding to the words, and as nodes in the graph, the similarity between the feature vectors is used to define their distance and weight w uv The total number of nodes in the graph is equal to the sum of the unmarked data and the marked data. The second part is the use of label propagationAn algorithm that aims to make the label distributions between neighboring nodes as similar to each other as possible by optimizing an objective function that minimizes the Kullback-Leibler distance. And finally, marking words corresponding to the nodes in all the graphs to obtain the enhanced data set. The specific method comprises the following steps:
the first part is to construct a relation graph required by a label propagation algorithm, wherein the vertex in the graph corresponds to the feature vector of the word, and the edge is the distance between the word features to capture semantic similarity. The total size of the graph is equal to the amount of label data V l And amount of unlabeled data V u And (4) summing. Modeling with a set of pre-trained word insertions (dimension d), where the 5-gram is embedded with the first 5 letters of the current word, the word closest to the verb, and a set of labels and cases of part-of-speech (43 and 4-dimensional one-hot vectors). The resulting feature vector of length 5d + d +43+4 is then projected to 100 dimensions using the PCA dimension reduction algorithm. The invention defines the weight w of the edge between nodes u and v uv The following were used: w is a uv =d e (u, v) ifv ∈ κ (u) oru ∈ κ (v), where κ (u) is the set of k-nearest neighbors of u, d e (u, v) is the Euclidean distance between any two nodes u and v in the graph.
For each node i in the graph, the invention computes the edge probability using a forward and backward computation process q i }. Let θ i Representing the estimation of the CRF parameters after the nth iteration, the invention calculates the edge probability for the IOBES label of each label position i in the sentence j in the labeled and unlabeled data
Figure RE-GDA0002062223120000041
The second part enhances the data, for annotation of unlabeled data in the dataset, using a label propagation algorithm which aims to make the label distributions between adjacent nodes as similar as possible to each other by optimizing an objective function that minimizes the Kullback-Leibler distance by minimizing the following Kullback-Leibler distance: i) For all the nodes of the word in the graph, the distribution r of the labeled data u And predicted label distribution q u . ii) all nodes u and their neighbors v, q in the graph u And q is v The distribution of (c). iii) Q for all distributed nodes u And CRF edge probability
Figure RE-GDA0002062223120000051
If the node is not connected to a labeled vertex, the third term normalizes the prediction distribution into a CRF prediction, ensuring that the algorithm is at least as good as standard self-training.
And step 9: and (4) taking the space representation of the 20-dimensional words in the step (7) as vectors of the words, and inputting the enhanced labeling data set obtained in the step (8) into a conditional random field for training and modeling.
Taking the space representation of the 20-dimensional words in the step 7 as the vector of the words, the biLSTM finally outputs a labeling matrix P which basically comprises probability distribution of each label, then pouring the probability distribution into a CRF layer to obtain a labeling sequence y, calculating the score phi (y; x, theta) of the sequence y by the method, and then calculating the probability P of the labeling sequence y in all the labeling sequences θ (y | x), finally using back propagation for the objective function
Figure RE-GDA0002062223120000052
Maximization is performed to complete supervised learning, and the CRF model is also output as a final result. The specific method comprises the following steps:
keyword classification is a task where there is a strong dependency between output labels, e.g., I-distance cannot be followed by B-stream-method, so the present invention does not make independent labeling decisions for each output, but jointly models them using conditional random fields. For the input sentence x = (x 1, x2, x 3., xn), we consider P to be the score matrix of the biLSTM network output. The size of P is n × m, where n is the number of tokens in the sentence and m is the number of different tokens. P t,i The score of the ith label corresponding to the first word to death in the sentence. The invention uses a first order Markov model and defines a transformation matrix T, where T i,j The table scores from label i to label j. The invention also increases y 0 And y n As start and end identifiers. Thus the dimension of T matrix becomesm+2。
Given a possible output y and a neural network parameter θ, the present invention defines the score as
Figure RE-GDA0002062223120000053
The probability of sequence y is obtained by applying softmax on all possible tag sequences:
P θ (y|x)=exp(φ(y;x,θ))/∑ y′∈Y exp(φ(y′;x,θ))
where Y represents all possible tag sequences. The normalization term can be computed efficiently using a forward algorithm.
Finally, the label data in the dataset is used for preliminary training, during which the invention maximizes log-probability L (Y; X, theta) for the correct label sequence for a given corpus { X, Y }. While the back propagation is done based on a gradient calculated using the total score of the sentence.
After the trained CRF algorithm is obtained, the CRF algorithm is combined with the feature extraction part constructed before, and then the text can be labeled. I.e. inputting a sentence x = (x) 1 ,x 2 ,x 3 ...,x n ) Then, a labeled sequence y = (y) is obtained 1 ,y 2 ,y 3 ...,y n )。
The meaning of the parameters in the above formula is illustrated below:
Figure RE-GDA0002062223120000054
Figure RE-GDA0002062223120000061
nothing in this specification is said to apply to the prior art.
It should be emphasized that the embodiments described herein are illustrative rather than restrictive, and thus the present invention is not limited to the embodiments described in the detailed description, but also includes other embodiments that can be derived from the technical solutions of the present invention by those skilled in the art.

Claims (9)

1. A medical document professional vocabulary automatic labeling method is characterized by comprising the following steps:
step 1, performing data preprocessing on an input medical document to obtain a preprocessed medical document text;
step 2, using the biLSTM modeling text to obtain an alphabetical feature vector of the word;
step 3, modeling a text by using word2vec to obtain a word-level feature vector of the word;
step 4, obtaining language feature vectors of words based on the language characteristics of the text language;
step 5, fusing the letter-level feature vectors, the word-level feature vectors and the language feature vectors of the words obtained in the step 2, the step 3 and the step 4 to obtain coding vectors of the words;
and 6, marking the words of the medical document text after word segmentation as the following four medical entities: disease name, disease symptom, treatment means and drug name, wherein each entity uses IOBES to represent the concrete position of the word in the entity to obtain a labeled data set;
step 7, taking the text obtained in the step 1 and the encoding vector of the word obtained in the step 5 as the input of the biLSTM, and outputting a multi-dimensional vector as the space representation of the word for each word;
step 8, expanding the labeled data set by using a label propagation algorithm to obtain an enhanced labeled data set;
and 9, inputting the multidimensional vector obtained in the step 7 and the enhanced labeling data set obtained in the step 8 into a conditional random field for training and modeling, and finally outputting a labeling result.
2. The automatic labeling method for the professional vocabulary of the medical document according to claim 1, wherein: the specific implementation method of the step 1 comprises the following steps: firstly, segmenting the input medical document to form an array, storing each word and punctuation marks in the text, then removing stop words, finally extracting word stems and word shapes to restore to obtain the basic form of the words, and forming the unmarked word array.
3. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the specific implementation method of the step 2 comprises the following steps: and (3) coding the letter-level features of the preprocessed medical document text by using biLSTM, and coding by using the first five letters of each word to finally obtain a letter-level feature vector with the length of 5d.
4. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the specific implementation method of the step 3 comprises the following steps: and (3) encoding the Word-level features of the preprocessed medical document text by using a Word2Vec algorithm of Google, and finally obtaining a Word-level feature vector with the length of d and aiming at each Word.
5. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the specific implementation method of the step 4 comprises the following steps: according to the language features of the text language, a manual definition method is adopted to define the following characteristics for the preprocessed medical document text: the first letter case, all lower case words, all upper case words, part of speech and grammatical structure form a length 21 feature vector, each feature being represented by 0 or 1.
6. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the specific implementation method of the step 5 is as follows: the letter level eigenvector, the word level eigenvector, and the language eigenvector are concatenated together to form a comprehensive eigenvector for each word with length 5d + 21.
7. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the labeled data set of step 6 is a combined label comprising 20 categories.
8. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the specific implementation method of the step 7 is as follows: utilizing the combined feature vector formed by the three features obtained in the step 5, and arranging all feature vectors of the whole word array to form a training data matrix, wherein the number of rows of the matrix is the number of words in the word array, and the number of columns of the matrix is 5d + d +21; using bilSTM, the hidden layer through the forward and backward computation processes is passed as input to the linear layer, which projects dimensions to a size of 20 in the tag type space and serves as input to the CRF layer.
9. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the specific implementation method of the step 8 is as follows: firstly, constructing a graph based on feature vectors corresponding to words, defining the distance and the weight w of the words by using the similarity between the feature vectors as nodes in the graph uv The total number of nodes in the graph is equal to the sum of the unmarked data and the marked data; and then, optimizing an objective function of minimizing the Kullback-Leibler distance by using a label propagation algorithm to enable label distribution between adjacent nodes to be similar to each other as much as possible, and finally enabling words corresponding to the nodes in all the graphs to be labeled to obtain an enhanced data set.
CN201910265223.3A 2019-04-03 2019-04-03 Medical document professional vocabulary automatic labeling method Active CN110059185B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910265223.3A CN110059185B (en) 2019-04-03 2019-04-03 Medical document professional vocabulary automatic labeling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910265223.3A CN110059185B (en) 2019-04-03 2019-04-03 Medical document professional vocabulary automatic labeling method

Publications (2)

Publication Number Publication Date
CN110059185A CN110059185A (en) 2019-07-26
CN110059185B true CN110059185B (en) 2022-10-04

Family

ID=67318293

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910265223.3A Active CN110059185B (en) 2019-04-03 2019-04-03 Medical document professional vocabulary automatic labeling method

Country Status (1)

Country Link
CN (1) CN110059185B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111063446B (en) * 2019-12-17 2023-06-16 医渡云(北京)技术有限公司 Method, apparatus, device and storage medium for standardizing medical text data
WO2021142534A1 (en) * 2020-01-13 2021-07-22 Knowtions Research Inc. Method and system of using hierarchical vectorisation for representation of healthcare data
CN111666406B (en) * 2020-04-13 2023-03-31 天津科技大学 Short text classification prediction method based on word and label combination of self-attention
CN111651991B (en) * 2020-04-15 2022-08-26 天津科技大学 Medical named entity identification method utilizing multi-model fusion strategy
CN111797612A (en) * 2020-05-15 2020-10-20 中国科学院软件研究所 Method for extracting automatic data function items
CN111797263A (en) * 2020-07-08 2020-10-20 北京字节跳动网络技术有限公司 Image label generation method, device, equipment and computer readable medium
CN112101014B (en) * 2020-08-20 2022-07-26 淮阴工学院 Chinese chemical industry document word segmentation method based on mixed feature fusion
CN113808752A (en) * 2020-12-04 2021-12-17 四川医枢科技股份有限公司 Medical document identification method, device and equipment
CN113297852B (en) * 2021-07-26 2021-11-12 北京惠每云科技有限公司 Medical entity word recognition method and device
CN114386424B (en) * 2022-03-24 2022-06-10 上海帜讯信息技术股份有限公司 Industry professional text automatic labeling method, industry professional text automatic labeling device, industry professional text automatic labeling terminal and industry professional text automatic labeling storage medium
CN115563311B (en) * 2022-10-21 2023-09-15 中国能源建设集团广东省电力设计研究院有限公司 Document labeling and knowledge base management method and knowledge base management system
CN115858819B (en) * 2023-01-29 2023-05-16 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Sample data amplification method and device
CN117034917B (en) * 2023-10-08 2023-12-22 中国医学科学院医学信息研究所 English text word segmentation method, device and computer readable medium
CN117095782B (en) * 2023-10-20 2024-02-06 上海森亿医疗科技有限公司 Medical text quick input method, system, terminal and editor

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8745093B1 (en) * 2000-09-28 2014-06-03 Intel Corporation Method and apparatus for extracting entity names and their relations
CN108491382A (en) * 2018-03-14 2018-09-04 四川大学 A kind of semi-supervised biomedical text semantic disambiguation method
CN108664473A (en) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 Recognition methods, electronic device and the readable storage medium storing program for executing of text key message
CN108829801A (en) * 2018-06-06 2018-11-16 大连理工大学 A kind of event trigger word abstracting method based on documentation level attention mechanism
CN108831559A (en) * 2018-06-20 2018-11-16 清华大学 A kind of Chinese electronic health record text analyzing method and system
CN109299262A (en) * 2018-10-09 2019-02-01 中山大学 A kind of text implication relation recognition methods for merging more granular informations

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8745093B1 (en) * 2000-09-28 2014-06-03 Intel Corporation Method and apparatus for extracting entity names and their relations
CN108491382A (en) * 2018-03-14 2018-09-04 四川大学 A kind of semi-supervised biomedical text semantic disambiguation method
CN108664473A (en) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 Recognition methods, electronic device and the readable storage medium storing program for executing of text key message
CN108829801A (en) * 2018-06-06 2018-11-16 大连理工大学 A kind of event trigger word abstracting method based on documentation level attention mechanism
CN108831559A (en) * 2018-06-20 2018-11-16 清华大学 A kind of Chinese electronic health record text analyzing method and system
CN109299262A (en) * 2018-10-09 2019-02-01 中山大学 A kind of text implication relation recognition methods for merging more granular informations

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种半监督三维模型语义自动标注方法;尚福华等;《计算机工程与应用》;20111114(第06期);全文 *
基于细粒度词表示的命名实体识别研究;林广和等;《中文信息学报》;20181115(第11期);全文 *

Also Published As

Publication number Publication date
CN110059185A (en) 2019-07-26

Similar Documents

Publication Publication Date Title
CN110059185B (en) Medical document professional vocabulary automatic labeling method
CN110008469B (en) Multilevel named entity recognition method
WO2021139424A1 (en) Text content quality evaluation method, apparatus and device, and storage medium
CN111241294B (en) Relationship extraction method of graph convolution network based on dependency analysis and keywords
CN109753660B (en) LSTM-based winning bid web page named entity extraction method
CN111966917A (en) Event detection and summarization method based on pre-training language model
CN112784051A (en) Patent term extraction method
CN111027595A (en) Double-stage semantic word vector generation method
CN112966525B (en) Law field event extraction method based on pre-training model and convolutional neural network algorithm
CN110781290A (en) Extraction method of structured text abstract of long chapter
CN110750646B (en) Attribute description extracting method for hotel comment text
CN112800184B (en) Short text comment emotion analysis method based on Target-Aspect-Opinion joint extraction
CN115687626A (en) Legal document classification method based on prompt learning fusion key words
CN114386417A (en) Chinese nested named entity recognition method integrated with word boundary information
CN115310448A (en) Chinese named entity recognition method based on combining bert and word vector
CN112101014A (en) Chinese chemical industry document word segmentation method based on mixed feature fusion
CN117010387A (en) Roberta-BiLSTM-CRF voice dialogue text naming entity recognition system integrating attention mechanism
CN115238693A (en) Chinese named entity recognition method based on multi-word segmentation and multi-layer bidirectional long-short term memory
CN112989830B (en) Named entity identification method based on multiple features and machine learning
CN113191150B (en) Multi-feature fusion Chinese medical text named entity identification method
CN117236338B (en) Named entity recognition model of dense entity text and training method thereof
Venkataramana et al. Abstractive text summarization using bart
CN115809666B (en) Named entity recognition method integrating dictionary information and attention mechanism
CN115358227A (en) Open domain relation joint extraction method and system based on phrase enhancement
CN114661912A (en) Knowledge graph construction method, device and equipment based on unsupervised syntactic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240408

Address after: Room 1518B, Unit 2, 12th Floor, Huizhi Building, No. 9 Xueqing Road, Haidian District, Beijing, 100080

Patentee after: Beijing contention Technology Co.,Ltd.

Country or region after: China

Address before: No.9, 13th Street, economic and Technological Development Zone, Binhai New Area, Tianjin

Patentee before: TIANJIN University OF SCIENCE AND TECHNOLOGY

Country or region before: China

TR01 Transfer of patent right