CN110059185B

CN110059185B - Medical document professional vocabulary automatic labeling method

Info

Publication number: CN110059185B
Application number: CN201910265223.3A
Authority: CN
Inventors: 王嫄; 高铭; 王栋; 赵婷婷; 赵青; 陈亚瑞; 史艳翠; 孔娜; 王洁
Original assignee: Tianjin University of Science and Technology
Current assignee: Beijing Contention Technology Co ltd
Priority date: 2019-04-03
Filing date: 2019-04-03
Publication date: 2022-10-04
Anticipated expiration: 2039-04-03
Also published as: CN110059185A

Abstract

The invention relates to an automatic labeling method for medical document professional vocabularies, which comprises the following steps: performing data preprocessing on an input medical document to obtain a preprocessed medical document text; acquiring and fusing a letter-level feature vector, a word-level feature vector and a language feature vector of a word to serve as a coding vector of the word; classifying the word labels of the medical document texts after word segmentation to obtain a label data set; outputting a multidimensional vector as a spatial representation of the word for each word; acquiring an enhanced annotation data set; and training and modeling are carried out, and finally, a labeling result is output. The invention has reasonable design, adopts the semi-supervised learning algorithm to label a large amount of unlabelled data, successfully overcomes the defect of too little labeled data in the existing medical industry, effectively improves the data quantity which can be used by the model, greatly improves the labeling accuracy of the algorithm on keywords and professional vocabularies, and can be widely used in the medical literature treatment.

Description

Medical document professional vocabulary automatic labeling method

Technical Field

The invention belongs to the technical field of machine learning, and particularly relates to an automatic labeling method for professional vocabularies of medical documents.

Background

With the development of the community of medical research, more and more papers are published every year. There is an increasing need to find improved methods for articles and to automatically understand the key ideas in these articles. However, due to the wide variety of fields and extremely limited annotation resources, extraction of scientific information is relatively rare.

Meanwhile, with the demand of people on medical resources and the corresponding increase of the number of medical documents and cases, researchers and medical staff need to quickly arrange past medical data of patients. Professional vocabularies or keywords are often used for quickly helping medical staff to make judgment from patient cases, much time is needed for manually arranging the vocabularies and the keywords, and arrangement of a large number of cases and medical data cannot be completed quickly due to manpower limitation.

In summary, with the rising demand for medical resources, how to automatically label professional words or keywords to increase the speed of medical care personnel in processing cases and medical data and help them better treat patients is a problem that needs to be solved urgently.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides an automatic labeling method for medical document professional vocabularies, which adopts a semi-supervised learning algorithm to expand data volume, overcomes the problem of poor model performance caused by insufficient data volume labeling of the traditional medical texts, and finally improves the accuracy of recognizing the professional vocabularies and keywords in the texts.

The technical problem to be solved by the invention is realized by adopting the following technical scheme:

an automatic labeling method for medical document professional vocabularies comprises the following steps:

step 1, performing data preprocessing on an input medical document to obtain a preprocessed medical document text;

step 2, using the biLSTM modeling text to obtain an alphabetical feature vector of the word;

step 3, modeling a text by using word2vec to obtain a word-level feature vector of the word;

step 4, obtaining language feature vectors of words based on the language characteristics of the text language;

step 5, fusing the letter-level feature vectors, the word-level feature vectors and the language feature vectors of the words obtained in the step 2, the step 3 and the step 4 to obtain coding vectors of the words;

step 6, marking the words of the medical document text after word segmentation as the following four types of medical entities: the method comprises the following steps of (1) obtaining a labeled data set by using disease names, disease symptoms, treatment means and drug names, wherein each entity uses an IOBES to represent the specific position of a word in the entity;

step 7, taking the text obtained in the step 1 and the encoding vector of the word obtained in the step 5 as the input of the biLSTM, and outputting a multi-dimensional vector as the space representation of the word for each word;

step 8, expanding the labeled data set by using a label propagation algorithm to obtain an enhanced labeled data set;

and 9, taking the multidimensional vector in the step 7 as a space representation of a word as a vector of the word, inputting the enhanced labeled data set obtained in the step 8 into a conditional random field for training and modeling, and finally outputting a labeling result.

Further, the specific implementation method of step 1 is as follows: firstly, segmenting input medical documents to form an array, storing each word and punctuation in the text, then removing stop words, finally extracting word stems and word shapes to restore to obtain basic forms of the words, and forming unmarked word arrays.

Further, the specific implementation method of step 2 is as follows: and (3) coding the letter-level features of the preprocessed medical document text by using biLSTM, and coding by using the first five letters of each word to finally obtain a letter-level feature vector with the length of 5d.

Further, the specific implementation method of step 3 is as follows: and (3) encoding the Word-level features of the preprocessed medical document text by using a Word2Vec algorithm of Google, and finally obtaining a Word-level feature vector with the length of d and aiming at each Word.

Further, the specific implementation method of step 4 is as follows: according to the language features of the text language, a manual definition method is adopted to define the following characteristics for the preprocessed medical document text: the first letter case, all lower case words, all upper case words, part of speech and grammatical structure form a length 21 feature vector, each feature being represented by 0 or 1.

Further, the specific implementation method of step 5 is as follows: the letter level feature vector, the word level feature vector and the language feature vector are connected together to form a comprehensive feature vector for each word with the length of 5d + 21.

Further, the labeled data set of step 6 is a combined label including 20 categories.

Further, the specific implementation method of step 7 is as follows: utilizing the combined feature vector formed by the three features obtained in the step 5, and arranging all feature vectors of the whole word array to form a training data matrix, wherein the number of rows of the matrix is the number of words in the word array, and the number of columns of the matrix is 5d + d +21; using biLSTM, the hidden layer through the forward and backward computation processes is passed as input to the linear layer, which projects the dimensions to the tag type space of size 20 and serves as input to the CRF layer.

Further, the specific implementation method of step 8 is as follows: firstly, constructing a graph based on feature vectors corresponding to words, defining the distance and the weight w of the words by using the similarity between the feature vectors as nodes in the graph _uv The total number of nodes in the graph is equal to the sum of the unmarked data and the marked data; and then, optimizing an objective function of minimizing the Kullback-Leibler distance by using a label propagation algorithm to enable label distribution between adjacent nodes to be similar to each other as much as possible, and finally enabling words corresponding to the nodes in all the graphs to be labeled to obtain an enhanced data set.

Further, the specific implementation method of step 9 is as follows: taking the space representation of the multidimensional words obtained in the step 7 as vectors of the words, the biLSTM finally outputs a labeling matrix P, the P labeling matrix comprises probability distribution of each label, the probability distribution is poured into a CRF layer to obtain a labeling sequence y, and a labeling sequence y is calculatedThe score phi (y; x, theta) of the sequence y is calculated, and the probability P of the occurrence of the labeled sequence y in all labeled sequences is calculated _θ (y | x) finally using back propagation for the objective function log-

Maximization is performed to complete supervised learning, while the CRF model is output as the final result.

The invention has the advantages and positive effects that:

1. the invention divides the keywords in the medical literature into four categories of disease name (disease), symptom (symptom), treatment means (treatment-method) and Drug name (Drug-name), and labels the medical document or case on the professional vocabulary based on the semi-supervised learning labeling method, so that the medical staff or the student can quickly understand the content in the text under the condition of extremely low manpower and material resource consumption, and can make medical decision or research better.

2. The invention adopts the semi-supervised learning algorithm to label a large amount of unlabelled data, successfully overcomes the defect of too little labeled data in the existing medical industry, effectively improves the data quantity which can be used by the model, greatly improves the labeling accuracy of the algorithm on keywords and professional vocabularies, and can be widely used in the medical literature treatment.

Drawings

FIG. 1 is a process flow diagram of the present invention.

Detailed Description

The embodiments of the present invention will be described in detail with reference to the accompanying drawings.

The design idea of the invention is as follows: and labeling medical documents or cases on professional vocabularies by using a machine learning algorithm and technology and based on a semi-supervised learning labeling method. The invention constructs a three-layer hierarchical neural network to mark texts: (1) Words in the text are subjected to vectorization feature extraction in three ways, bilSTM extraction is based on letter features, word2Vec performs Word embedding on the words, and feature extraction is based on a grammatical structure. (2) BilSTM extracts the context information surrounding the word in the same sentence and encodes it. (3) The CRF labeling layer jointly uses a CRF objective function to model words and label labels and makes a final label judgment.

Based on the above design concept, the automatic labeling method for medical document professional vocabulary of the invention, as shown in fig. 1, comprises the following steps:

step 1: and carrying out data preprocessing on the input medical document to obtain a preprocessed medical document text.

In this step, the input is a medical document and the output is an array of words. The data preprocessing method comprises the following steps: the method comprises the steps of segmenting a medical document to form an array, storing each word and punctuation marks in a text, removing stop words such as is, but, shall and by, extracting word stems and word shapes, and recovering to obtain basic forms of the words. For example, run, ran, runs, after extracting the stem, get run words, the morphological reduction is basically similar, any form of vocabulary can be reduced to a general form, and the data preprocessing is performed to get the unmarked word array composed of the general form.

Step 2: the text is modeled using BilSTM, resulting in letter-level feature vectors for words.

The input of the step is a word array after data preprocessing, the output is a characteristic vector based on letter characteristics, and the length is 5d.

The present invention uses biLSTM to encode letter features, called Character-based embedding. The alphabetical features of a word are generated by the hidden layer vector during forward propagation and backward computation of BilSTM, and the advantage of building a character-based embedding layer is that it can handle words and formulas outside of the vocabulary, which are common in these data, with the generated feature vector length set to d. However, the invention adopts the head 5-gram (i.e. the first 5 letters are coded from left to right of the word, if there are no 5 letters, the remaining length is zero-filled), and the final feature vector length is 5d.

And step 3: text is modeled using word2vec, resulting in word-level feature vectors for words.

In this step, words using a fixed vocabulary (plus unknown Word tokens) are mapped to vector space, initialized using Word2Vec pre-training with different corpus combinations, the words are encoded using Google's Word2Vec algorithm, called wordlenbedding, and finally the length of the feature vector for each Word is obtained as d.

And 4, step 4: and designing to obtain the language feature vector of the word based on the language characteristics of the text language.

In this step, the input is a word array for performing word segmentation only on the original text, and the output is a feature vector designed based on language features, and the length is 21. Features are not trained separately, are defined manually, and are called FeatureEmbedding. The features defined in this section include: the initial case, all the lower case words, all the upper case words, part of speech and grammatical structure of the words total 21 features, the length of the formed feature vector is 21, and each feature is represented by 0 or 1 to indicate whether the corresponding feature exists or not.

And 5: and (4) fusing the letter-level features, the word-level features and the language features of the words obtained in the steps 2, 3 and 4 to obtain the encoding vectors of the words.

The input of the step is character level feature vector, word level feature vector and word language feature vector, and the three feature vectors are connected together to form a comprehensive feature vector for each word with the length of 5d + 21.

Step 6: labeling data: and marking words of the segmented electronic medical record text as four types of medical entities (diseases, symptoms, treatment means and medicine names), wherein each type of entity represents the specific position of the word in the entity by IOBES and is marked as 20 types in total to obtain a marked data set.

In order to be able to distinguish the span of two consecutive key phrases of the same type, the present invention assigns a tag to each word in the sentence, specifying their position and type in the phrase. On the basis of preprocessing data, each word is labeled with a corresponding label, the position of the phrase where the word is located and the corresponding category are represented, the position of the phrase is firstly labeled, IOBES (Inside, outlide, learning, end and Singleton) is used uniformly to describe the position of a word in a professional phrase or a vocabulary, I represents that the word is in the interior of the phrase, B represents that the word is in the beginning of the phrase, E represents that the word is in the End of the phrase, S represents that the word is a single professional vocabulary, and 0 represents that the word is in the exterior of the phrase and is contained in a sentence. In conjunction with the present invention, these specialized words and phrases are labeled with categories including disease name (disease), symptom (symptom), treatment (treatment-method), and Drug name (Drug-name), and combined to form a complete label, for example, criterialisll patents are labeled "B-symptom I-symptom E-symptom". The combined tags thus formed have a total of 20 categories. Because the training set is very huge in quantity, part of data is labeled, and labeled data and unlabeled data in the data set are formed.

And 7: and (3) taking the text preprocessed in the step (1) and the encoding vector of the word in the step (5) as the input of the biLSTM, setting the output to be 20, and outputting a 20-dimensional vector to each word as the space representation of the word.

In the step, a combined feature vector formed by the three features obtained in the step 5 is utilized, all feature vectors of the whole word array are arranged to form a training data matrix, the number of rows of the matrix is the number of words in the word array, and the number of columns of the matrix is 5d + d + 21.

And 8: and 6, performing data enhancement on the labeling data set obtained in the step 6, and expanding the labeling data set by using a label propagation algorithm to obtain an enhanced labeling data set.

The method comprises the following two steps: the first part is to construct a graph based on the feature vectors corresponding to the words, and as nodes in the graph, the similarity between the feature vectors is used to define their distance and weight w _uv The total number of nodes in the graph is equal to the sum of the unmarked data and the marked data. The second part is the use of label propagationAn algorithm that aims to make the label distributions between neighboring nodes as similar to each other as possible by optimizing an objective function that minimizes the Kullback-Leibler distance. And finally, marking words corresponding to the nodes in all the graphs to obtain the enhanced data set. The specific method comprises the following steps:

the first part is to construct a relation graph required by a label propagation algorithm, wherein the vertex in the graph corresponds to the feature vector of the word, and the edge is the distance between the word features to capture semantic similarity. The total size of the graph is equal to the amount of label data V _l And amount of unlabeled data V _u And (4) summing. Modeling with a set of pre-trained word insertions (dimension d), where the 5-gram is embedded with the first 5 letters of the current word, the word closest to the verb, and a set of labels and cases of part-of-speech (43 and 4-dimensional one-hot vectors). The resulting feature vector of length 5d + d +43+4 is then projected to 100 dimensions using the PCA dimension reduction algorithm. The invention defines the weight w of the edge between nodes u and v _uv The following were used: w is a _uv ＝d _e (u, v) ifv ∈ κ (u) oru ∈ κ (v), where κ (u) is the set of k-nearest neighbors of u, d _e (u, v) is the Euclidean distance between any two nodes u and v in the graph.

For each node i in the graph, the invention computes the edge probability using a forward and backward computation process q _i }. Let θ _i Representing the estimation of the CRF parameters after the nth iteration, the invention calculates the edge probability for the IOBES label of each label position i in the sentence j in the labeled and unlabeled data

The second part enhances the data, for annotation of unlabeled data in the dataset, using a label propagation algorithm which aims to make the label distributions between adjacent nodes as similar as possible to each other by optimizing an objective function that minimizes the Kullback-Leibler distance by minimizing the following Kullback-Leibler distance: i) For all the nodes of the word in the graph, the distribution r of the labeled data _u And predicted label distribution q _u . ii) all nodes u and their neighbors v, q in the graph _u And q is _v The distribution of (c). iii) Q for all distributed nodes _u And CRF edge probability

If the node is not connected to a labeled vertex, the third term normalizes the prediction distribution into a CRF prediction, ensuring that the algorithm is at least as good as standard self-training.

And step 9: and (4) taking the space representation of the 20-dimensional words in the step (7) as vectors of the words, and inputting the enhanced labeling data set obtained in the step (8) into a conditional random field for training and modeling.

Taking the space representation of the 20-dimensional words in the step 7 as the vector of the words, the biLSTM finally outputs a labeling matrix P which basically comprises probability distribution of each label, then pouring the probability distribution into a CRF layer to obtain a labeling sequence y, calculating the score phi (y; x, theta) of the sequence y by the method, and then calculating the probability P of the labeling sequence y in all the labeling sequences _θ (y | x), finally using back propagation for the objective function

Maximization is performed to complete supervised learning, and the CRF model is also output as a final result. The specific method comprises the following steps:

keyword classification is a task where there is a strong dependency between output labels, e.g., I-distance cannot be followed by B-stream-method, so the present invention does not make independent labeling decisions for each output, but jointly models them using conditional random fields. For the input sentence x = (x 1, x2, x 3., xn), we consider P to be the score matrix of the biLSTM network output. The size of P is n × m, where n is the number of tokens in the sentence and m is the number of different tokens. P _t，i The score of the ith label corresponding to the first word to death in the sentence. The invention uses a first order Markov model and defines a transformation matrix T, where T _i，j The table scores from label i to label j. The invention also increases y ₀ And y _n As start and end identifiers. Thus the dimension of T matrix becomesm+2。

Given a possible output y and a neural network parameter θ, the present invention defines the score as

The probability of sequence y is obtained by applying softmax on all possible tag sequences:

P _θ (y|x)＝exp(φ(y；x，θ))/∑ _y′∈Y exp(φ(y′；x，θ))

where Y represents all possible tag sequences. The normalization term can be computed efficiently using a forward algorithm.

Finally, the label data in the dataset is used for preliminary training, during which the invention maximizes log-probability L (Y; X, theta) for the correct label sequence for a given corpus { X, Y }. While the back propagation is done based on a gradient calculated using the total score of the sentence.

After the trained CRF algorithm is obtained, the CRF algorithm is combined with the feature extraction part constructed before, and then the text can be labeled. I.e. inputting a sentence x = (x) ₁ ，x ₂ ，x ₃ ...，x _n ) Then, a labeled sequence y = (y) is obtained ₁ ，y ₂ ，y ₃ ...，y _n )。

The meaning of the parameters in the above formula is illustrated below:

nothing in this specification is said to apply to the prior art.

It should be emphasized that the embodiments described herein are illustrative rather than restrictive, and thus the present invention is not limited to the embodiments described in the detailed description, but also includes other embodiments that can be derived from the technical solutions of the present invention by those skilled in the art.

Claims

1. A medical document professional vocabulary automatic labeling method is characterized by comprising the following steps:

and 6, marking the words of the medical document text after word segmentation as the following four medical entities: disease name, disease symptom, treatment means and drug name, wherein each entity uses IOBES to represent the concrete position of the word in the entity to obtain a labeled data set;

and 9, inputting the multidimensional vector obtained in the step 7 and the enhanced labeling data set obtained in the step 8 into a conditional random field for training and modeling, and finally outputting a labeling result.

2. The automatic labeling method for the professional vocabulary of the medical document according to claim 1, wherein: the specific implementation method of the step 1 comprises the following steps: firstly, segmenting the input medical document to form an array, storing each word and punctuation marks in the text, then removing stop words, finally extracting word stems and word shapes to restore to obtain the basic form of the words, and forming the unmarked word array.

3. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the specific implementation method of the step 2 comprises the following steps: and (3) coding the letter-level features of the preprocessed medical document text by using biLSTM, and coding by using the first five letters of each word to finally obtain a letter-level feature vector with the length of 5d.

4. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the specific implementation method of the step 3 comprises the following steps: and (3) encoding the Word-level features of the preprocessed medical document text by using a Word2Vec algorithm of Google, and finally obtaining a Word-level feature vector with the length of d and aiming at each Word.

5. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the specific implementation method of the step 4 comprises the following steps: according to the language features of the text language, a manual definition method is adopted to define the following characteristics for the preprocessed medical document text: the first letter case, all lower case words, all upper case words, part of speech and grammatical structure form a length 21 feature vector, each feature being represented by 0 or 1.

6. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the specific implementation method of the step 5 is as follows: the letter level eigenvector, the word level eigenvector, and the language eigenvector are concatenated together to form a comprehensive eigenvector for each word with length 5d + 21.

7. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the labeled data set of step 6 is a combined label comprising 20 categories.

8. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the specific implementation method of the step 7 is as follows: utilizing the combined feature vector formed by the three features obtained in the step 5, and arranging all feature vectors of the whole word array to form a training data matrix, wherein the number of rows of the matrix is the number of words in the word array, and the number of columns of the matrix is 5d + d +21; using bilSTM, the hidden layer through the forward and backward computation processes is passed as input to the linear layer, which projects dimensions to a size of 20 in the tag type space and serves as input to the CRF layer.

9. The automatic labeling method for the professional vocabularies of the medical documents as claimed in claim 1, wherein the method comprises the following steps: the specific implementation method of the step 8 is as follows: firstly, constructing a graph based on feature vectors corresponding to words, defining the distance and the weight w of the words by using the similarity between the feature vectors as nodes in the graph _uv The total number of nodes in the graph is equal to the sum of the unmarked data and the marked data; and then, optimizing an objective function of minimizing the Kullback-Leibler distance by using a label propagation algorithm to enable label distribution between adjacent nodes to be similar to each other as much as possible, and finally enabling words corresponding to the nodes in all the graphs to be labeled to obtain an enhanced data set.