CN114357108A - Medical text classification method based on semantic template and language model - Google Patents

Medical text classification method based on semantic template and language model Download PDF

Info

Publication number
CN114357108A
CN114357108A CN202111412869.3A CN202111412869A CN114357108A CN 114357108 A CN114357108 A CN 114357108A CN 202111412869 A CN202111412869 A CN 202111412869A CN 114357108 A CN114357108 A CN 114357108A
Authority
CN
China
Prior art keywords
language
template
training
text
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111412869.3A
Other languages
Chinese (zh)
Inventor
侯聪
唐文瀚
余海东
肖茂
许瑞玲
王俊
蔡冲
夏凯
陈佳林
白良俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Daguan Data Chengdu Co ltd
Original Assignee
Daguan Data Chengdu Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Daguan Data Chengdu Co ltd filed Critical Daguan Data Chengdu Co ltd
Priority to CN202111412869.3A priority Critical patent/CN114357108A/en
Publication of CN114357108A publication Critical patent/CN114357108A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a medical text classification method based on a semantic template and a language model, which comprises the following steps: s1: preparing a corpus, namely acquiring medical related corpus through a public source; s2: training a language model, namely training a language model based on a self-attention neural network architecture according to the corpus obtained in the step S1; s3: designing a language template; s4: training a classification task; s5: and inputting the text to be tested to predict the result and confirm the classification of the text to be tested. Under the scene of medical audit, the pre-training model semantic model with strong generalization capability based on a large amount of unstructured data can be applied to the scene of small data or even zero data, the requirement on labeled data is reduced, and overfitting is avoided.

Description

Medical text classification method based on semantic template and language model
Technical Field
The application relates to the field of natural language processing algorithms, in particular to a medical text classification method based on a semantic template and a language model.
Background
In the auditing process, classification processing is sometimes required for medical texts such as prescriptions, and in the face of a large amount of data, automatic classification of models is the only choice. A medical text classification system belongs to a text classification application that automatically classifies input text under a certain classification by a model.
Document classification in the medical field is often faced with the problem of low labeling data. In an application scenario, the partial category has only a few or even completely absent labeled documents. But in the open source domain, there are many related texts.
Common classification models based on supervised learning classification such as fasttextext, textCNN and BERT-based text classification applications all adopt the same mode: training data is obtained according to a classification system, then a model is trained, and after training is finished, prediction of document classification can be carried out. In a scene with sufficient labeling data, the scheme is applicable; however, the following disadvantages exist when such a scheme is directly applied to document classification in the professional field with rare annotation data:
1. under the condition of small data volume, overfitting is easy to happen, so that the generalization capability of the model is reduced, and the prediction effect is poor;
2. when the sample is not marked, the model cannot be trained, and no prediction method is available.
Disclosure of Invention
In order to solve the defects in the prior art, the invention provides a medical text classification method based on a semantic template and a language model, which can apply a pre-training model semantic model with strong generalization capability based on a large amount of unstructured data to a scene of small data or even zero data in the scene of medical audit, reduce the requirements on labeled data and avoid overfitting.
In order to achieve the above object, the present invention provides the following technical solutions:
a medical text classification method based on semantic templates and language models comprises the following steps:
s1: preparing a corpus, namely acquiring medical related corpus through a public source;
s2: training a language model, namely training a language model based on a self-attention neural network architecture according to the corpus obtained in the step S1;
s3: designing a language template;
s4: training a classification task;
s5: and inputting the text to be tested to predict the result and confirm the classification of the text to be tested.
Further, in the step S1, the public sources include wikipedia, medical textbooks, drug manuals, papers, and periodicals.
Further, in the step S3, the method for designing the language template includes the following steps:
t1: designing language template contents according to common language habits in the medical field through task contents, and reserving an answer space to obtain a seed template;
t2: obtaining the synonymous transcription of the seed template in the modes of synonym replacement, retranslation and similar sentence generation to form a candidate template library;
t3: combining templates in the candidate template library with answers in an answer space, inputting the combined templates into the language model in the step S2 for scoring, and selecting a template with the highest average score;
further, in the step S4, the method for training the classification task includes the following steps:
q1: converting the classified annotation data into template training data according to the language template acquired in the step S3;
q2: inputting the converted classification label data into the language model of step S2 for fine tuning training.
Further, in the step Q1, the method for converting the classification label data into the template training data includes splicing the condition in the classification label data with the language template acquired in S3, and using the spliced condition as the input of the language model acquired in the step S2 during training, and using the result in the classification label data as the output of the training.
Further, in step S5, the method for inputting the text to be tested to perform result prediction includes splicing the text to be tested with the language template acquired in step S3, inputting the spliced text to the language model trained in step S4 to perform prediction, and finally selecting the answer with the highest probability in the answer space as the prediction result to output, thereby confirming the specific classification of the text to be tested.
According to the technical scheme, the invention has the following advantages:
1. the general semantic encoder is trained on a large amount of general field data, semantics can be effectively encoded, under the condition of no labeled data, a certain accuracy can be achieved by using a proper template to directly predict, and training data is not needed.
2. Under the condition of small labeled data quantity, the probability of overfitting is reduced by means of the generalization capability of the pre-training language model.
Detailed Description
The invention will be further described in detail with reference to specific examples in order to clearly understand the type of construction and the manner of usage, but the scope of protection of the present patent shall not be limited thereby.
A medical text classification method based on semantic templates and language models comprises the following steps:
s1: preparing a corpus, namely acquiring medical related corpus through a public source;
s2: training a language model, namely training a language model based on a self-attention neural network architecture according to the corpus obtained in the step S1;
s3: designing a language template;
s4: training a classification task;
s5: and inputting the text to be tested to predict the result and confirm the classification of the text to be tested.
Further, in the step S1, the public sources include wikipedia, medical textbooks, drug manuals, papers, and periodicals.
Further, in the step S3, the method for designing the language template includes the following steps:
t1: designing language template contents according to common language habits in the medical field through task contents, and reserving an answer space to obtain a seed template;
t2: obtaining the synonymous transcription of the seed template in the modes of synonym replacement, retranslation and similar sentence generation to form a candidate template library;
t3: combining templates in the candidate template library with answers in an answer space, inputting the combined templates into the language model in the step S2 for scoring, and selecting a template with the highest average score;
further, in the step S4, the method for training the classification task includes the following steps:
q1: converting the classified annotation data into template training data according to the language template acquired in the step S3;
q2: inputting the converted classification label data into the language model of step S2 for fine tuning training.
Further, in the step Q1, the method for converting the classification label data into the template training data includes splicing the condition in the classification label data with the language template acquired in S3, and using the spliced condition as the input of the language model acquired in the step S2 during training, and using the result in the classification label data as the output of the training.
Further, in step S5, the method for inputting the text to be tested to perform result prediction includes splicing the text to be tested with the language template acquired in step S3, inputting the spliced text to the language model trained in step S4 to perform prediction, and finally selecting the answer with the highest probability in the answer space as the prediction result to output, thereby confirming the specific classification of the text to be tested.
Example one
In auditing, a large amount of prescription texts need to be classified, and according to the prescription contents, the diagnosis contents mainly belong to which diseases. There is a small amount of label data in the system, but the distribution is uneven. There are a number of less common diseases without annotated data.
Preparing a corpus: a large number of medical domain texts are crawled on the web, including wikipedia, medical textbooks, drug manuals, etc.
Training a language model: the BERT based language model is trained.
Template selection: 1. as a general knowledge, a series of seed templates are obtained, such as "diagnosis xxx according to the above", "diagnosis based on the above", "patient has xxx", "diagnosis xxx according to the above" and the like.
2. And obtaining the synonymous transcription of the seed template in a manner of synonym replacement, retranslation and similar sentence generation. And finally obtaining a batch of candidate templates.
3. Through professional knowledge and project requirements, a required answer candidate space is obtained, namely disease categories such as kidney diseases, digestive diseases and the like.
4. And (4) combining different templates with the answers, inputting the domain language model obtained in the step 1, and calculating the average probability. And acquiring the template with the highest score as the finally used semantic template.
Training classification tasks: converting the annotation data into a form of a template, and thus converting the annotation data into a language model task, wherein the prescription content comprises: "the patient's blood pressure remains above 90/150 for a long period of time after testing and can be identified as hypertension" with the following labels: "cardiovascular and cerebrovascular diseases"; then according to the template, after the training data is spliced to the template, the result that the blood pressure of the patient is maintained above 90/150 for a long time after detection, the hypertension [ SEP ] can be determined to be diagnosed according to the above, the patient has the condition that the 'cardiovascular and cerebrovascular diseases' are generated later as the input in the training, and the prediction target of the model is the 'cardiovascular and cerebrovascular diseases'; and based on the converted training data, performing fine tuning training on the language model obtained in the language model training.
And (3) result prediction and classification: during prediction, the input text is spliced into a template to obtain an input model, and the model obtains a classification result in a mode of predicting classification names, for example, the content of a prescription is 'the blood pressure of a patient is higher for a long time' and the hypertension is considered. ". The spliced template obtains that the blood pressure of the patient is higher for a long time, and the hypertension is considered. The patient is diagnosed as suffering from the SEP, the patient is input into a model, the model predicts the following language probability through an iterative method, and finally the language probability is found out in an answer space, and then a plurality of words of the cardiovascular and cerebrovascular diseases are generated, so that the probability is highest, and the classification of the prescription content is expected to be the cardiovascular and cerebrovascular diseases.
It goes without saying that the invention has other similar processes in addition to the examples described above. In sum, the present invention includes other variations and alternatives that will be apparent to those skilled in the art.

Claims (6)

1. A medical text classification method based on semantic templates and language models is characterized by comprising the following steps:
s1: preparing a corpus, namely acquiring medical related corpus through a public source;
s2: training a language model, namely training a language model based on a self-attention neural network architecture according to the corpus obtained in the step S1;
s3: designing a language template;
s4: training a classification task;
s5: and inputting the text to be tested to predict the result and confirm the classification of the text to be tested.
2. The method for classifying medical texts based on semantic templates and language models according to claim 1, wherein in the step S1, the public sources include wikipedia, medical textbooks, drug manuals, papers and periodicals.
3. The method for classifying medical texts based on semantic templates and language models according to claim 1, wherein in the step S3, the method for designing language templates comprises the following steps:
t1: designing language template contents according to common language habits in the medical field through task contents, and reserving an answer space to obtain a seed template;
t2: obtaining the synonymous transcription of the seed template in the modes of synonym replacement, retranslation and similar sentence generation to form a candidate template library;
t3: combining the templates in the candidate template library with the answers in the answer space, inputting the combined templates into the language model in the step S2 for scoring, and selecting the template with the highest average score.
4. The method for classifying medical texts based on semantic templates and language models according to claim 1, wherein in the step S4, the method for training classification task comprises the following steps:
q1: converting the classified annotation data into template training data according to the language template acquired in the step S3;
q2: inputting the converted classification label data into the language model of step S2 for fine tuning training.
5. The method of claim 4, wherein in the step Q1, the method of converting the classified label data into template training data is to splice the condition in the classified label data with the language template obtained in S3 as the input of the step S2 during the training of the language model, and to use the result in the classified label data as the output of the step S2 during the training.
6. The method for classifying medical texts based on semantic templates and language models as claimed in claim 1, wherein in step S5, the method for inputting the text to be tested for result prediction comprises splicing the text to be tested with the language template obtained in S3, inputting the spliced text to be tested into the language model trained in step S4 for prediction, and finally selecting the answer with the highest probability in the answer space as the prediction result for output, thereby confirming the specific classification of the text to be tested.
CN202111412869.3A 2021-11-25 2021-11-25 Medical text classification method based on semantic template and language model Pending CN114357108A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111412869.3A CN114357108A (en) 2021-11-25 2021-11-25 Medical text classification method based on semantic template and language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111412869.3A CN114357108A (en) 2021-11-25 2021-11-25 Medical text classification method based on semantic template and language model

Publications (1)

Publication Number Publication Date
CN114357108A true CN114357108A (en) 2022-04-15

Family

ID=81096070

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111412869.3A Pending CN114357108A (en) 2021-11-25 2021-11-25 Medical text classification method based on semantic template and language model

Country Status (1)

Country Link
CN (1) CN114357108A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117828087A (en) * 2024-01-08 2024-04-05 北京三维天地科技股份有限公司 LLM-based medical instrument data classification method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170177715A1 (en) * 2015-12-21 2017-06-22 Adobe Systems Incorporated Natural Language System Question Classifier, Semantic Representations, and Logical Form Templates
US20200118682A1 (en) * 2018-10-12 2020-04-16 Fujitsu Limited Medical diagnostic aid and method
CN111401066A (en) * 2020-03-12 2020-07-10 腾讯科技(深圳)有限公司 Artificial intelligence-based word classification model training method, word processing method and device
CN112686044A (en) * 2021-01-18 2021-04-20 华东理工大学 Medical entity zero sample classification method based on language model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170177715A1 (en) * 2015-12-21 2017-06-22 Adobe Systems Incorporated Natural Language System Question Classifier, Semantic Representations, and Logical Form Templates
US20200118682A1 (en) * 2018-10-12 2020-04-16 Fujitsu Limited Medical diagnostic aid and method
CN111401066A (en) * 2020-03-12 2020-07-10 腾讯科技(深圳)有限公司 Artificial intelligence-based word classification model training method, word processing method and device
CN112686044A (en) * 2021-01-18 2021-04-20 华东理工大学 Medical entity zero sample classification method based on language model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117828087A (en) * 2024-01-08 2024-04-05 北京三维天地科技股份有限公司 LLM-based medical instrument data classification method and system
CN117828087B (en) * 2024-01-08 2024-07-09 北京三维天地科技股份有限公司 LLM-based medical instrument data classification method and system

Similar Documents

Publication Publication Date Title
del Carmen Legaz-García et al. A semantic web based framework for the interoperability and exploitation of clinical models and EHR data
van Buchem et al. The digital scribe in clinical practice: a scoping review and research agenda
WO2020136520A1 (en) Artificial intelligence augmented document capture and processing systems and methods
US20020083103A1 (en) Machine editing system incorporating dynamic rules database
US10977155B1 (en) System for providing autonomous discovery of field or navigation constraints
Kaiser et al. Versioning computer-interpretable guidelines: semi-automatic modeling of ‘Living Guidelines’ using an information extraction method
Pearce et al. Coding and classifying GP data: the POLAR project
KR20200139008A (en) User intention-analysis based contract recommendation and autocomplete service using deep learning
Fischer et al. Toward automatic evaluation of medical abstracts: The current value of sentiment analysis and machine learning for classification of the importance of PubMed abstracts of randomized trials for stroke
CN115526171A (en) Intention identification method, device, equipment and computer readable storage medium
Wu et al. Generating API tags for tutorial fragments from stack overflow
CN113450905A (en) Medical auxiliary diagnosis system, method and computer readable storage medium
Voll et al. Improving the utility of speech recognition through error detection
CN114357108A (en) Medical text classification method based on semantic template and language model
CN114298314A (en) Multi-granularity causal relationship reasoning method based on electronic medical record
Ceci et al. Turning text into research networks: information retrieval and computational ontologies in the creation of scientific databases
Martínez-González et al. On the evaluation of thesaurus tools compatible with the Semantic Web
CN116560631B (en) Method and device for generating machine learning model code
US20230317261A1 (en) Automated regulatory decision-making for compliance
WO2023233392A1 (en) Method and system for producing unified natural language processing objects
CN116719840A (en) Medical information pushing method based on post-medical-record structured processing
Yu et al. Similar questions correspond to similar SQL queries: a case-based reasoning approach for text-to-SQL translation
De Arriba et al. Merging datasets for emotion analysis
Dunn et al. Audiovisual Metadata Platform Pilot Development (AMPPD), Final Project Report
Vo et al. Recognizing and splitting conditional sentences for automation of business processes management

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination