CN117033627A - Medical text classification method and device based on prompt learning - Google Patents
Medical text classification method and device based on prompt learning Download PDFInfo
- Publication number
- CN117033627A CN117033627A CN202310817238.2A CN202310817238A CN117033627A CN 117033627 A CN117033627 A CN 117033627A CN 202310817238 A CN202310817238 A CN 202310817238A CN 117033627 A CN117033627 A CN 117033627A
- Authority
- CN
- China
- Prior art keywords
- text
- classification
- prompt
- medical
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 201000010099 disease Diseases 0.000 claims abstract description 14
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims abstract description 14
- 238000001914 filtration Methods 0.000 claims abstract description 10
- 238000000605 extraction Methods 0.000 claims description 25
- 238000012549 training Methods 0.000 claims description 23
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000013507 mapping Methods 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 6
- 208000029078 coronary artery disease Diseases 0.000 claims description 4
- 230000002526 effect on cardiovascular system Effects 0.000 claims description 4
- 230000010354 integration Effects 0.000 claims description 4
- 208000031225 myocardial ischemia Diseases 0.000 claims description 4
- 230000000399 orthopedic effect Effects 0.000 claims description 4
- 206010060965 Arterial stenosis Diseases 0.000 claims description 3
- 208000010392 Bone Fractures Diseases 0.000 claims description 3
- 206010008088 Cerebral artery embolism Diseases 0.000 claims description 3
- 206010008111 Cerebral haemorrhage Diseases 0.000 claims description 3
- 206010010356 Congenital anomaly Diseases 0.000 claims description 3
- 206010017076 Fracture Diseases 0.000 claims description 3
- 206010019233 Headaches Diseases 0.000 claims description 3
- 208000003926 Myelitis Diseases 0.000 claims description 3
- 208000006011 Stroke Diseases 0.000 claims description 3
- 208000026317 Tietze syndrome Diseases 0.000 claims description 3
- 206010045378 Ulnar nerve injury Diseases 0.000 claims description 3
- 241000469816 Varus Species 0.000 claims description 3
- 208000007474 aortic aneurysm Diseases 0.000 claims description 3
- 231100000869 headache Toxicity 0.000 claims description 3
- 208000015181 infectious disease Diseases 0.000 claims description 3
- 230000002458 infectious effect Effects 0.000 claims description 3
- 201000010849 intracranial embolism Diseases 0.000 claims description 3
- 230000002792 vascular Effects 0.000 claims description 3
- 230000019771 cognition Effects 0.000 abstract description 2
- 210000004072 lung Anatomy 0.000 description 7
- 230000000694 effects Effects 0.000 description 5
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 201000005202 lung cancer Diseases 0.000 description 4
- 208000020816 lung neoplasm Diseases 0.000 description 4
- 208000031513 cyst Diseases 0.000 description 3
- 239000003814 drug Substances 0.000 description 3
- 210000003734 kidney Anatomy 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 239000000243 solution Substances 0.000 description 3
- 241000483399 Ipimorpha retusa Species 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 206010006451 bronchitis Diseases 0.000 description 2
- 238000002512 chemotherapy Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000003902 lesion Effects 0.000 description 2
- 210000001165 lymph node Anatomy 0.000 description 2
- 210000005015 mediastinal lymph node Anatomy 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 206010011732 Cyst Diseases 0.000 description 1
- 206010019646 Hepatic cyst Diseases 0.000 description 1
- 206010061218 Inflammation Diseases 0.000 description 1
- 239000005411 L01XE02 - Gefitinib Substances 0.000 description 1
- 206010027476 Metastases Diseases 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 210000000702 aorta abdominal Anatomy 0.000 description 1
- 210000000988 bone and bone Anatomy 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000006378 damage Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- XGALLCVXEZPNRQ-UHFFFAOYSA-N gefitinib Chemical compound C=12C=C(OCCCN3CCOCC3)C(OC)=CC2=NC=NC=1NC1=CC=C(F)C(Cl)=C1 XGALLCVXEZPNRQ-UHFFFAOYSA-N 0.000 description 1
- 229960002584 gefitinib Drugs 0.000 description 1
- 210000004907 gland Anatomy 0.000 description 1
- 230000004054 inflammatory process Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 230000009401 metastasis Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 210000004224 pleura Anatomy 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24317—Piecewise classification, i.e. whereby each classification requires several discriminant rules
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H70/00—ICT specially adapted for the handling or processing of medical references
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Animal Behavior & Ethology (AREA)
- Molecular Biology (AREA)
- Epidemiology (AREA)
- Medical Informatics (AREA)
- Primary Health Care (AREA)
- Public Health (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a medical text classification method and device based on prompt learning. The method comprises the following steps: acquiring prompt information for primary classification from an original medical text based on event priori information and knowledge priori information, wherein the primary classification comprises department categories; filtering the original medical text, integrating the filtered text with the prompt information, and inputting the integrated text into a large language generation model; and calculating the similarity between the result sequence output by the large language generation model and each secondary classification label representing the disease category under the primary classification, and taking the label category corresponding to the maximum value of the similarity as the secondary category output by the large language generation model. The invention can realize primary classification based on department category, and can realize secondary classification based on disease category under the primary classification, thereby being more in line with the general cognition in the medical field and enabling the classification result to be more standardized; meanwhile, the classification labels can be unfixed, and the multi-stage text classification of the open domain can be effectively realized.
Description
Technical Field
The invention belongs to the technical field of text processing, and particularly relates to a medical text classification method and device based on prompt learning.
Background
The classification of medical texts with special information and domain knowledge has always been very challenging, and ambiguity and conflict in text category may exist between clauses due to narrative modes such as terms, medical experience, contrast turning, etc., which causes difficulty in text classification of the whole text by the model. Various deep learning models for text classification have been proposed by research institutions at home and abroad for a long time, but small models cannot provide more accurate results due to the restrictions of model analysis capability and hardware operation speed, and large models have larger deployment difficulty. In recent years, with the rapid development of large language models, new forms of natural language processing such as knowledge migration, fine tuning and prompt learning are proposed, so that small models can easily provide convenient and rapid text classification products and services by means of mass parameters of the large language models.
The existing medical text classification method based on templated prompt learning generally comprises the steps of splicing an initial input text and prompt contents, feeding the spliced initial input text and the prompt contents into a template generation encoder, and further updating the prompt contents in the encoder model along with the initial input text. And then extracting the updated template parameters, splicing the updated template parameters with the initial input text which is not updated, feeding the initial input text into a large-scale pre-training language model, and improving the performances such as the accuracy of text classification by means of mass parameters and training examples of the large-scale pre-training language model. However, since the method updates the template parameters to be the encoder based on the multi-head self-attention network, the data source is a massive universal text corpus, the method has good effect on general text information processing, has the problem of lacking knowledge related to the medical field, has a plurality of deviations and shortages on knowledge understanding in the field, and therefore, the understanding of the context and the semantics of the medical text can be limited. In addition, the method does not consider important medical text information related to facts (such as existing medical entities, important index value information and relationships among entities), and the lack of text information is not beneficial to classifying text contents in a special context. The ChatGPT model generates classification labels, although a text classification mode of an open domain can be realized, and various non-given label contents are given. However, due to lack of enhancement and standardization of medical knowledge, output categories are generally not controllable, and generally only a first-level classification label is obtained, so that more advanced judgment on a certain disease is not accurate enough.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a medical text classification method and device based on prompt learning.
In order to achieve the above object, the present invention adopts the following technical scheme.
In a first aspect, the present invention provides a medical text classification method based on prompt learning, comprising the steps of:
acquiring prompt information for primary classification from an original medical text based on event priori information and knowledge priori information, wherein the primary classification comprises department categories;
filtering the original medical text, integrating the filtered text with the prompt information, and inputting the integrated text into a large language generation model;
and calculating the similarity between the result sequence output by the large language generation model and each secondary classification label representing the disease category under the primary classification, and taking the label category corresponding to the maximum value of the similarity as the secondary category output by the large language generation model.
Further, the method for acquiring the prompt message comprises the following steps:
inputting an original medical text into a medical event feature extraction model to obtain structured event prompt information A;
inputting the original medical text into a medical knowledge feature extraction model to obtain structured knowledge prompt information B;
and integrating A, B and outputting to a prompt template generation module to obtain prompt information for primary classification.
Still further, the medical event feature extraction model is a pre-trained RoBERTa model, with fine-tuning training using a dataset composed of published medical information.
Further, the medical knowledge feature extraction model is a pre-trained BioBERT model, and fine tuning training is performed by using a data set consisting of the disclosed medical knowledge graph data.
Further, the hint template generating module is formed by 4 layers of identical networks, and each layer of network adopts a transducer decoder to remove a second attention layer structure.
Further, the original medical text is filtered by entering a text filter; the text filter is a classifier adopting a ConvNeXt Tiny network, and training of the text filter is realized by marking each sentence as a valid sentence or an invalid sentence through manual data marking at sentence level on the medical text.
Further, before integrating the prompt information with the filtered text, the space mapping module maps the feature space where the prompt information is located to the feature space where the large language generation model is input and understood.
Further, the primary classification includes at least cardiovascular, orthopedic and neurologic families; secondary classifications under the vascular department include at least coronary heart disease, ischemic heart disease, stroke, aortic aneurysm and renal arterial stenosis; the secondary classification under orthopaedics at least comprises patellar fracture, ulnar nerve injury, congenital hip varus and infectious costal chondritis; secondary classification of neurology includes at least cerebral hemorrhage, cerebral embolism, headache, and myelitis.
Further, the method for calculating the similarity comprises the following steps:
embedding the labels to be classified into representations by using a pre-trained Word vector model Word2Vec, and recording as (x) 1 ,x 2 ,…,x n );
The result sequence output by the large language generation model is recorded as (y) 1 ,y 2 ,…,y n );
The following calculation (x 1 ,x 2 ,…,x n ) And (y) 1 ,y 2 ,…,y n ) Cosine similarity of (c):
wherein K is cos Is cosine similarity.
In a second aspect, the present invention provides a medical text classification apparatus based on prompt learning, comprising:
the prompt information acquisition module is used for acquiring prompt information for primary classification from the original medical text based on event priori information and knowledge priori information, wherein the primary classification comprises department categories;
the information integration module is used for filtering the original medical text, integrating the filtered text with the prompt information and inputting the integrated text into the large language generation model;
and the secondary classification module is used for calculating the similarity between the result sequence output by the large language generation model and each secondary classification label representing the disease category under the primary classification, and taking the label category corresponding to the maximum value of the similarity as the secondary category output by the large language generation model.
Compared with the prior art, the invention has the following beneficial effects.
According to the invention, prompt information for primary classification including department categories is obtained from an original medical text based on event prior information and knowledge prior information, the original medical text is filtered, the filtered text and the prompt information are integrated and then input into a large language generation model, the similarity between a result sequence output by the large language generation model and each secondary classification label representing a disease category under the primary classification is calculated, and the label category corresponding to the maximum value of the similarity is used as the secondary category output by the large language generation model, so that medical text classification based on prompt learning is realized. The invention not only can realize the primary classification based on the class of departments, but also can realize the secondary classification based on the class of diseases under the primary classification, thereby being more in line with the general cognition in the medical field and enabling the classification result to be more standardized; meanwhile, the classification labels can be unfixed, and the multi-stage text classification of the open domain can be effectively realized.
Drawings
Fig. 1 is a flowchart of a medical text classification method based on prompt learning according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of two stages of another embodiment of the present invention.
Fig. 3 is a schematic structural diagram of the template generation module.
Fig. 4 is a block diagram of a medical text classification apparatus based on prompt learning according to an embodiment of the present invention.
Detailed Description
The present invention will be further described with reference to the drawings and the detailed description below, in order to make the objects, technical solutions and advantages of the present invention more apparent. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 1 is a flowchart of a medical text classification method based on prompt learning according to an embodiment of the invention, including the following steps:
step 101, acquiring prompt information for primary classification from an original medical text based on event priori information and knowledge priori information, wherein the primary classification comprises department categories;
step 102, filtering an original medical text, integrating the filtered text with the prompt information, and inputting the integrated text into a large language generation model;
and 103, calculating the similarity between a result sequence output by the large language generation model and each secondary classification label representing the disease category under the primary classification, and taking the label category corresponding to the maximum value of the similarity as the secondary category output by the large language generation model.
In this embodiment, step 101 is mainly used to obtain the prompt information for the first class classification. Prompt learning is to excite the factual knowledge in the pre-training language model by using a template, and then map the obtained factual knowledge as an answer space to a target space so as to complete a required downstream task. The prompt learning can effectively help to realize the fine adjustment of the downstream task by using a large language model, and the efficient transfer learning of parameters is realized by using little fine adjustment calculation force. The existing text classification technology based on prompt learning has two problems: one is to adopt the content of the hard template (discrete prompt template, an actual text character string) to carry on prompt learning, the hard template builds subjectivity strong, express the content limitation, the model can not fully utilize the information in the text, result in predicting the lower accuracy; the other is to introduce a soft prompt template (continuous prompt template, which is directly described in an embedded space of a bottom language model, usually vector representation) without visual interpretation, lack of prior knowledge, realize the tuning process by only relying on data of a downstream task, hardly obtain better performance, and have relatively high training cost. For this purpose, the present embodiment obtains a soft hint template based on knowledge extraction information and event extraction results. The template can effectively obtain the main events of the text medicine and the related knowledge of the medicine, and better optimize the classification effect of the large language generation model. Meanwhile, the text can be classified at one stage according to the soft prompt information, and one-stage class labels of the text can be primarily obtained. The first class classification of this embodiment mainly includes department categories, such as cardiovascular department, orthopedics, and the like.
In this embodiment, step 102 is mainly used for integrating the original medical text with the prompt message. The original medical text generally adopts an unstructured free text description mode, a large amount of invalid information and fields, such as unit information and addresses which are irrelevant to the text information, text watermarks, some data interfaces or special fields in an HTML file, and the like, are usually present, so that the effective information density of one text is low, and a large amount of uncertain noise is introduced by directly acquiring all text features, so that the core content of the text is weakened. For this reason, the original medical text is filtered first, and the filtered text after filtering the invalid sentence is integrated with the prompt message. The integration is to splice the two part feature vectors together.
For ease of understanding, a piece of original medical text is extracted below:
contrast 2016-11-29CT after right lung cancer chemotherapy: the upper right lung lesions were slightly smaller than the anterior extent and the peripheral small inflammation was slightly smaller than the anterior extent. Both lungs are scattered in the nodules, approximately the same as before. The subclavian area, the mediastinal multiple lymph nodes on the left side, is slightly reduced in consideration of metastasis. A hepatic cyst. Small cyst of left kidney. Contrast 2016-11-29CT after right lung cancer chemotherapy: the upper right lung has irregular nodular shape and flaky focus, the boundary is not clear, the maximum layer size is about 12mm multiplied by 8mm, the edge is in a foliated shape, the uneven reinforcement of scanning is enhanced, the oblique pleura is tightly attached, the partial range is slightly reduced than the anterior range, the upper right lung has a little plaque flaky slightly higher density shadow, and the boundary is not clear, and is obviously reduced than the anterior range. The right kidney and the two kidney glands were not abnormal. There were no enlarged lymph nodes behind the diaphragm feet and beside the abdominal aorta. No signs of bone destruction were seen in the scan range.
In this embodiment, step 103 is mainly used for outputting the second class. In this embodiment, the integrated information obtained in step 102 is input into the large language generation model, the similarity between the result sequence output by the large language generation model and the secondary classification label is calculated, and the label class corresponding to the maximum value of the similarity is used as the output secondary class. Large language generative models refer to language models that contain a large number (hundreds of billions or more) of parameters that need to be trained on a large number of text data, such as model GPT-3.GPT-3 is obtained by stacking multiple layers of the transducer's Decoder layer (with the 2 nd attention structure removed). GPT-3 has very powerful functions in text generation, and can achieve a very surprise effect in very complex NLP tasks, so that the GPT-3 is a great development for creating solutions in the fields of creative novel, stories, resume, narrative, chat robots, text abstracts and the like. The model fine adjustment is not needed through supervised learning to complete the tasks; for a new task, GPT-3 requires very little data to understand the needs of the task and reach a method that is near or beyond the current optima. The secondary classification of this embodiment is a disease category under the primary classification (department category), such as coronary heart disease, ischemic heart disease, and the like. In order to realize open domain classification, the classification labels can be unfixed, and when the classification labels are acquired and can be embedded in a learning way, vector representation is carried out on the secondary classification labels, and representation and similarity calculation are carried out on the corresponding primary classification labels and the corresponding combined labels of the secondary classification labels. Therefore, the embodiment not only can realize the first-level classification, but also can realize the second-level open domain classification under the first-level classification.
As an optional embodiment, the method for obtaining the prompt information includes:
inputting an original medical text into a medical event feature extraction model to obtain structured event prompt information A;
inputting the original medical text into a medical knowledge feature extraction model to obtain structured knowledge prompt information B;
and integrating A, B and outputting to a prompt template generation module to obtain prompt information for primary classification.
The embodiment provides a technical scheme for acquiring the prompt information. As shown in fig. 2, in this embodiment, first, an original medical text is input into a medical event feature extraction model and a medical knowledge feature extraction model, respectively, and event prompt information and knowledge prompt information are extracted, respectively. For example, after inputting a sentence "boundary is unclear, the maximum layer size is about 12mm×8mm" into the medical event feature extraction model, output "primary lesion size: 12mm x 8mm "; and inputting the triplet < lung malignant tumor, disease corresponding medicine and gefitinib > into a medical knowledge feature extraction model, and outputting the "oncology". And then integrating (splicing) the extracted event prompt information and the knowledge prompt information, and then sending the integrated (spliced) event prompt information and the knowledge prompt information to a template generation module to obtain prompt information for primary classification.
As an alternative embodiment, the medical event feature extraction model is a pre-trained RoBERTa model, and the fine-tuning training is performed by using a data set composed of the disclosed medical information.
The embodiment provides a network structure and a training method of a medical event feature extraction model. The medical event feature extraction model of this embodiment is a pre-trained RoBERTa model. BERT is a classical network in deep learning, and a BERT-based network model performs well in the field of natural language processing, and has a remarkable effect on text information characterization and feature coding in particular. RoBERT belongs to an enhanced version of BERT and is also a finer tuning version of the BERT model. RoBERT uses more parameters, more training data, and introduces dynamic masks and a larger vocabulary for training. A great deal of literature and experiments show that the RoBERTa model has excellent performance in the task of information extraction. Therefore, roBERT is selected as the medical event feature extraction model in this embodiment. The embodiment utilizes the disclosed medical information to construct a training data set, adopts the data set to fine tune a pre-trained RoBERTa model, and can be used for extracting the structured event information from the unstructured text after training is completed.
As an alternative embodiment, the medical knowledge feature extraction model is a pretrained BioBERT model, and the fine tuning training is performed by using a data set composed of the disclosed medical knowledge-graph data.
The embodiment provides a network structure and a training method of a medical knowledge feature extraction model. The medical knowledge feature extraction model of this embodiment is a pre-trained BioBERT model. The BioBERT model is also a finer tuning version of the BERT model. The BioBERT is a pre-trained biomedical language representation model for biomedical text mining, and the model contains a large amount of medical text information through the injection of reinforced medical knowledge, and is excellent in biomedical text mining task, so that the BioBERT is selected as a medical knowledge feature extraction model in the embodiment. The embodiment utilizes publicly available large-scale medical knowledge graph data to finely tune the model on a BioBERT pre-training model, adopts a knowledge department multi-label classification task, and can be used for extracting structural knowledge information from unstructured text after training is completed.
As an alternative embodiment, the hint template generating module is formed of 4 layers of identical networks, each layer of network employing a transducer decoder to eliminate the second attention layer structure.
The embodiment provides a network structure of the template generation module. The template generation module of this embodiment is formed by stacking 4 layers of identical networks, each layer of network employing a transducer decoder to remove the second attention layer structure. The specific structure is shown in fig. 3.
As an alternative embodiment, the original medical text is filtered by entering a text filter; the text filter is a classifier adopting a ConvNeXt Tiny network, and training of the text filter is realized by marking each sentence as a valid sentence or an invalid sentence through manual data marking at sentence level on the medical text.
The embodiment provides a technical scheme for filtering the original medical text. The present embodiment filters the original medical text by providing a text filter. The text filter is a classifier and outputs two classes of valid sentences and invalid sentences. According to the embodiment, manual data marking at sentence level is carried out on medical texts, classification marking (effective sentences and ineffective sentences) is carried out on each sentence, classification task training is carried out by adopting a ConvNeXt Tiny network, and a trained model is a text filter. And judging sentence by sentence after the text of one section passes through a text filter, reserving effective sentences, and removing ineffective sentences. For example, the following piece of original medical text is entered into the text filter:
anterior basal segment nodule of right lower lung: surrounding lung cancer is highly likely. Proximal nodules in the left lower anterior-internal basal segment: transfer? Follow-up after treatment is recommended. Mediastinal lymph node enlargement. Bronchitis. Multiple cysts of double kidneys. Auditing physicians: liu Haipeng; the reporting physician: liu Haipeng; the issuing physician: chou. Reporting date: 14.6.2019.11:22:03. This report is for clinical reference only, and the unpublished physician (electronic) signature is considered invalid. XIMC-06-ZY-YL-0142018-A1.
The text filter outputs the following text:
anterior basal segment nodule of right lower lung: surrounding lung cancer is highly likely. Proximal nodules in the left lower anterior-internal basal segment: transfer? Follow-up after treatment is recommended. Mediastinal lymph node enlargement. Bronchitis. Multiple cysts of double kidneys. Reporting date: 14.6.2019.11:22:03.
It can be seen that filtering text deletes the invalid sentence "auditor: liu Haipeng; the reporting physician: liu Haipeng; the issuing physician: chou "and" this report is for clinical reference only, and the unpublished physician (electronic) signature is considered invalid. XIMC-06-ZY-YL-0142018-A1).
As an alternative embodiment, before integrating with the filtered text, the prompt message maps the feature space where the prompt message is located to the feature space where the large language generation model is input and understandable through a space mapping module.
In this embodiment, a space mapping module is further provided before the large language generating model, and is used for mapping the feature space where the prompt information is located to the feature space where the large language generating model is input and understandable. The space mapping module comprises a transducer coding layer and a full connection layer.
As an alternative embodiment, the first class classification includes at least cardiovascular, orthopedic and neurologic families; secondary classifications under the vascular department include at least coronary heart disease, ischemic heart disease, stroke, aortic aneurysm and renal arterial stenosis; the secondary classification under orthopaedics at least comprises patellar fracture, ulnar nerve injury, congenital hip varus and infectious costal chondritis; secondary classification of neurology includes at least cerebral hemorrhage, cerebral embolism, headache, and myelitis.
The present embodiment provides several specific primary and secondary classification categories. It should be noted that this embodiment is only a preferred embodiment, and does not exclude or negate other possible embodiments, such as different departments and different disease categories.
As an optional embodiment, the method for calculating the similarity includes:
embedding the labels to be classified into representations by using a pre-trained Word vector model Word2Vec, and recording as (x) 1 ,x 2 ,…,x n );
The result sequence output by the large language generation model is recorded as (y) 1 ,y 2 ,…,y n );
The following calculation (x 1 ,x 2 ,…,x n ) And (y) 1 ,y 2 ,…,y n ) Cosine similarity of (c):
wherein K is cos Is cosine similarity.
The embodiment provides a technical scheme for calculating the similarity between the result sequence output by the large language generation model and the secondary classification label. In this embodiment, word2Vec is used to perform embedded representation on the tag to be classified, and then cosine similarity between the result sequence and the embedded representation is calculated. The calculation formula is as above.
Fig. 4 is a schematic diagram of a medical text classification apparatus for prompting learning according to an embodiment of the present invention, the apparatus includes:
the prompt information acquisition module 11 is configured to acquire prompt information for a first class classification from the original medical text based on event priori information and knowledge priori information, where the first class classification includes a department class;
the information integration module 12 is used for filtering the original medical text, integrating the filtered text with the prompt information and inputting the integrated text into the large language generation model;
and the secondary classification module 13 is used for calculating the similarity between the result sequence output by the large language generation model and each secondary classification label representing the disease category under the primary classification, and taking the label category corresponding to the maximum value of the similarity as the secondary category output by the large language generation model.
The device of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and its implementation principle and technical effects are similar, and are not described here again.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.
Claims (10)
1. The medical text classification method based on prompt learning is characterized by comprising the following steps of:
acquiring prompt information for primary classification from an original medical text based on event priori information and knowledge priori information, wherein the primary classification comprises department categories;
filtering the original medical text, integrating the filtered text with the prompt information, and inputting the integrated text into a large language generation model;
and calculating the similarity between the result sequence output by the large language generation model and each secondary classification label representing the disease category under the primary classification, and taking the label category corresponding to the maximum value of the similarity as the secondary category output by the large language generation model.
2. The prompt learning-based medical text classification method of claim 1, wherein the method of obtaining the prompt information comprises:
inputting an original medical text into a medical event feature extraction model to obtain structured event prompt information A;
inputting the original medical text into a medical knowledge feature extraction model to obtain structured knowledge prompt information B;
and integrating A, B and outputting to a prompt template generation module to obtain prompt information for primary classification.
3. The prompt learning based medical text classification method of claim 2 wherein the medical event feature extraction model is a pre-trained RoBERTa model, fine-tuning training using a dataset composed of published medical information.
4. The prompt learning based medical text classification method of claim 2, wherein the medical knowledge feature extraction model is a pre-trained BioBERT model, and fine tuning training is performed using a dataset composed of published medical knowledge-graph data.
5. The prompt learning based medical text classification method of claim 2 wherein said prompt template generation module is comprised of 4 layers of identical networks, each layer of network employing a transducer decoder to remove a second attention layer structure.
6. The prompt learning based medical text classification method as claimed in claim 1, wherein the original medical text is filtered by inputting a text filter; the text filter is a classifier adopting a ConvNeXt Tiny network, and training of the text filter is realized by marking each sentence as a valid sentence or an invalid sentence through manual data marking at sentence level on the medical text.
7. The prompt learning-based medical text classification method according to claim 1, wherein the prompt information maps the feature space of the filtered text to the feature space understandable by the large language generation model input through a space mapping module before the prompt information is integrated with the text.
8. The prompt learning based medical text classification method of claim 1, wherein the primary classification comprises at least cardiovascular, orthopedic and neurology; secondary classifications under the vascular department include at least coronary heart disease, ischemic heart disease, stroke, aortic aneurysm and renal arterial stenosis; the secondary classification under orthopaedics at least comprises patellar fracture, ulnar nerve injury, congenital hip varus and infectious costal chondritis; secondary classification of neurology includes at least cerebral hemorrhage, cerebral embolism, headache, and myelitis.
9. The prompt learning-based medical text classification method as claimed in claim 1, wherein the similarity calculation method comprises:
embedding the labels to be classified into representations by using a pre-trained Word vector model Word2Vec, and recording as (x) 1 ,x 2 ,…,x n );
The result sequence output by the large language generation model is recorded as (y) 1 ,y 2 ,…,y n );
The following calculation (x 1 ,x 2 ,…,x n ) And (y) 1 ,y 2 ,…,y n ) Cosine similarity of (c):
wherein K is cos Is cosine similarity.
10. A prompt learning-based medical text classification apparatus, comprising:
the prompt information acquisition module is used for acquiring prompt information for primary classification from the original medical text based on event priori information and knowledge priori information, wherein the primary classification comprises department categories;
the information integration module is used for filtering the original medical text, integrating the filtered text with the prompt information and inputting the integrated text into the large language generation model;
and the secondary classification module is used for calculating the similarity between the result sequence output by the large language generation model and each secondary classification label representing the disease category under the primary classification, and taking the label category corresponding to the maximum value of the similarity as the secondary category output by the large language generation model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310817238.2A CN117033627A (en) | 2023-07-05 | 2023-07-05 | Medical text classification method and device based on prompt learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310817238.2A CN117033627A (en) | 2023-07-05 | 2023-07-05 | Medical text classification method and device based on prompt learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117033627A true CN117033627A (en) | 2023-11-10 |
Family
ID=88627026
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310817238.2A Pending CN117033627A (en) | 2023-07-05 | 2023-07-05 | Medical text classification method and device based on prompt learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117033627A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117972097A (en) * | 2024-03-29 | 2024-05-03 | 长城汽车股份有限公司 | Text classification method, classification device, electronic equipment and storage medium |
-
2023
- 2023-07-05 CN CN202310817238.2A patent/CN117033627A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117972097A (en) * | 2024-03-29 | 2024-05-03 | 长城汽车股份有限公司 | Text classification method, classification device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
He et al. | Pathvqa: 30000+ questions for medical visual question answering | |
Zeng et al. | Counterfactual generator: A weakly-supervised method for named entity recognition | |
Wieting et al. | Charagram: Embedding words and sentences via character n-grams | |
CN108009182B (en) | Information extraction method and device | |
CN108628824A (en) | A kind of entity recognition method based on Chinese electronic health record | |
CN111680089B (en) | Text structuring method, device and system and non-volatile storage medium | |
Grewal et al. | Radiology gets chatty: the ChatGPT saga unfolds | |
CN106934220A (en) | Towards the disease class entity recognition method and device of multi-data source | |
Deng et al. | Speech-based diagnosis of autism spectrum condition by generative adversarial network representations | |
US11468989B2 (en) | Machine-aided dialog system and medical condition inquiry apparatus and method | |
CN106897559A (en) | A kind of symptom and sign class entity recognition method and device towards multi-data source | |
CN117033627A (en) | Medical text classification method and device based on prompt learning | |
CN110134951A (en) | A kind of method and system for analyzing the potential theme phrase of text data | |
CN112241457A (en) | Event detection method for event of affair knowledge graph fused with extension features | |
Milling et al. | Is speech the new blood? recent progress in ai-based disease detection from audio in a nutshell | |
Banerjee et al. | Automatic inference of BI-RADS final assessment categories from narrative mammography report findings | |
CN117787282B (en) | Doctor-patient text intelligent extraction method based on large language model | |
Dasgupta et al. | Word2box: Capturing set-theoretic semantics of words using box embeddings | |
CN110069639B (en) | Method for constructing thyroid ultrasound field ontology | |
Gajbhiye et al. | Automatic report generation for chest x-ray images: a multilevel multi-attention approach | |
Kim et al. | Distilling wikipedia mathematical knowledge into neural network models | |
Wang et al. | Accurate classification of lung nodules on CT images using the TransUnet | |
CN110199354B (en) | Biological system information retrieval system and method | |
CN116842168B (en) | Cross-domain problem processing method and device, electronic equipment and storage medium | |
CN113609360A (en) | Scene-based multi-source data fusion analysis method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |