CN114357108A

CN114357108A - Medical text classification method based on semantic template and language model

Info

Publication number: CN114357108A
Application number: CN202111412869.3A
Authority: CN
Inventors: 侯聪; 唐文瀚; 余海东; 肖茂; 许瑞玲; 王俊; 蔡冲; 夏凯; 陈佳林; 白良俊
Original assignee: Daguan Data Chengdu Co ltd
Current assignee: Daguan Data Chengdu Co ltd
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2022-04-15

Abstract

The invention discloses a medical text classification method based on a semantic template and a language model, which comprises the following steps: s1: preparing a corpus, namely acquiring medical related corpus through a public source; s2: training a language model, namely training a language model based on a self-attention neural network architecture according to the corpus obtained in the step S1; s3: designing a language template; s4: training a classification task; s5: and inputting the text to be tested to predict the result and confirm the classification of the text to be tested. Under the scene of medical audit, the pre-training model semantic model with strong generalization capability based on a large amount of unstructured data can be applied to the scene of small data or even zero data, the requirement on labeled data is reduced, and overfitting is avoided.

Description

Medical text classification method based on semantic template and language model

Technical Field

The application relates to the field of natural language processing algorithms, in particular to a medical text classification method based on a semantic template and a language model.

Background

In the auditing process, classification processing is sometimes required for medical texts such as prescriptions, and in the face of a large amount of data, automatic classification of models is the only choice. A medical text classification system belongs to a text classification application that automatically classifies input text under a certain classification by a model.

Document classification in the medical field is often faced with the problem of low labeling data. In an application scenario, the partial category has only a few or even completely absent labeled documents. But in the open source domain, there are many related texts.

Common classification models based on supervised learning classification such as fasttextext, textCNN and BERT-based text classification applications all adopt the same mode: training data is obtained according to a classification system, then a model is trained, and after training is finished, prediction of document classification can be carried out. In a scene with sufficient labeling data, the scheme is applicable; however, the following disadvantages exist when such a scheme is directly applied to document classification in the professional field with rare annotation data:

1. under the condition of small data volume, overfitting is easy to happen, so that the generalization capability of the model is reduced, and the prediction effect is poor;

2. when the sample is not marked, the model cannot be trained, and no prediction method is available.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides a medical text classification method based on a semantic template and a language model, which can apply a pre-training model semantic model with strong generalization capability based on a large amount of unstructured data to a scene of small data or even zero data in the scene of medical audit, reduce the requirements on labeled data and avoid overfitting.

In order to achieve the above object, the present invention provides the following technical solutions:

a medical text classification method based on semantic templates and language models comprises the following steps:

s1: preparing a corpus, namely acquiring medical related corpus through a public source;

s2: training a language model, namely training a language model based on a self-attention neural network architecture according to the corpus obtained in the step S1;

s3: designing a language template;

s4: training a classification task;

s5: and inputting the text to be tested to predict the result and confirm the classification of the text to be tested.

Further, in the step S1, the public sources include wikipedia, medical textbooks, drug manuals, papers, and periodicals.

Further, in the step S3, the method for designing the language template includes the following steps:

t1: designing language template contents according to common language habits in the medical field through task contents, and reserving an answer space to obtain a seed template;

t2: obtaining the synonymous transcription of the seed template in the modes of synonym replacement, retranslation and similar sentence generation to form a candidate template library;

t3: combining templates in the candidate template library with answers in an answer space, inputting the combined templates into the language model in the step S2 for scoring, and selecting a template with the highest average score;

further, in the step S4, the method for training the classification task includes the following steps:

q1: converting the classified annotation data into template training data according to the language template acquired in the step S3;

q2: inputting the converted classification label data into the language model of step S2 for fine tuning training.

Further, in the step Q1, the method for converting the classification label data into the template training data includes splicing the condition in the classification label data with the language template acquired in S3, and using the spliced condition as the input of the language model acquired in the step S2 during training, and using the result in the classification label data as the output of the training.

Further, in step S5, the method for inputting the text to be tested to perform result prediction includes splicing the text to be tested with the language template acquired in step S3, inputting the spliced text to the language model trained in step S4 to perform prediction, and finally selecting the answer with the highest probability in the answer space as the prediction result to output, thereby confirming the specific classification of the text to be tested.

According to the technical scheme, the invention has the following advantages:

1. the general semantic encoder is trained on a large amount of general field data, semantics can be effectively encoded, under the condition of no labeled data, a certain accuracy can be achieved by using a proper template to directly predict, and training data is not needed.

2. Under the condition of small labeled data quantity, the probability of overfitting is reduced by means of the generalization capability of the pre-training language model.

Detailed Description

The invention will be further described in detail with reference to specific examples in order to clearly understand the type of construction and the manner of usage, but the scope of protection of the present patent shall not be limited thereby.

s3: designing a language template;

s4: training a classification task;

Example one

In auditing, a large amount of prescription texts need to be classified, and according to the prescription contents, the diagnosis contents mainly belong to which diseases. There is a small amount of label data in the system, but the distribution is uneven. There are a number of less common diseases without annotated data.

Preparing a corpus: a large number of medical domain texts are crawled on the web, including wikipedia, medical textbooks, drug manuals, etc.

Training a language model: the BERT based language model is trained.

Template selection: 1. as a general knowledge, a series of seed templates are obtained, such as "diagnosis xxx according to the above", "diagnosis based on the above", "patient has xxx", "diagnosis xxx according to the above" and the like.

2. And obtaining the synonymous transcription of the seed template in a manner of synonym replacement, retranslation and similar sentence generation. And finally obtaining a batch of candidate templates.

3. Through professional knowledge and project requirements, a required answer candidate space is obtained, namely disease categories such as kidney diseases, digestive diseases and the like.

4. And (4) combining different templates with the answers, inputting the domain language model obtained in the step 1, and calculating the average probability. And acquiring the template with the highest score as the finally used semantic template.

Training classification tasks: converting the annotation data into a form of a template, and thus converting the annotation data into a language model task, wherein the prescription content comprises: "the patient's blood pressure remains above 90/150 for a long period of time after testing and can be identified as hypertension" with the following labels: "cardiovascular and cerebrovascular diseases"; then according to the template, after the training data is spliced to the template, the result that the blood pressure of the patient is maintained above 90/150 for a long time after detection, the hypertension [ SEP ] can be determined to be diagnosed according to the above, the patient has the condition that the 'cardiovascular and cerebrovascular diseases' are generated later as the input in the training, and the prediction target of the model is the 'cardiovascular and cerebrovascular diseases'; and based on the converted training data, performing fine tuning training on the language model obtained in the language model training.

And (3) result prediction and classification: during prediction, the input text is spliced into a template to obtain an input model, and the model obtains a classification result in a mode of predicting classification names, for example, the content of a prescription is 'the blood pressure of a patient is higher for a long time' and the hypertension is considered. ". The spliced template obtains that the blood pressure of the patient is higher for a long time, and the hypertension is considered. The patient is diagnosed as suffering from the SEP, the patient is input into a model, the model predicts the following language probability through an iterative method, and finally the language probability is found out in an answer space, and then a plurality of words of the cardiovascular and cerebrovascular diseases are generated, so that the probability is highest, and the classification of the prescription content is expected to be the cardiovascular and cerebrovascular diseases.

It goes without saying that the invention has other similar processes in addition to the examples described above. In sum, the present invention includes other variations and alternatives that will be apparent to those skilled in the art.

Claims

1. A medical text classification method based on semantic templates and language models is characterized by comprising the following steps:

s3: designing a language template;

s4: training a classification task;

2. The method for classifying medical texts based on semantic templates and language models according to claim 1, wherein in the step S1, the public sources include wikipedia, medical textbooks, drug manuals, papers and periodicals.

3. The method for classifying medical texts based on semantic templates and language models according to claim 1, wherein in the step S3, the method for designing language templates comprises the following steps:

t3: combining the templates in the candidate template library with the answers in the answer space, inputting the combined templates into the language model in the step S2 for scoring, and selecting the template with the highest average score.

4. The method for classifying medical texts based on semantic templates and language models according to claim 1, wherein in the step S4, the method for training classification task comprises the following steps:

5. The method of claim 4, wherein in the step Q1, the method of converting the classified label data into template training data is to splice the condition in the classified label data with the language template obtained in S3 as the input of the step S2 during the training of the language model, and to use the result in the classified label data as the output of the step S2 during the training.

6. The method for classifying medical texts based on semantic templates and language models as claimed in claim 1, wherein in step S5, the method for inputting the text to be tested for result prediction comprises splicing the text to be tested with the language template obtained in S3, inputting the spliced text to be tested into the language model trained in step S4 for prediction, and finally selecting the answer with the highest probability in the answer space as the prediction result for output, thereby confirming the specific classification of the text to be tested.