CN117033627A

CN117033627A - Medical text classification method and device based on prompt learning

Info

Publication number: CN117033627A
Application number: CN202310817238.2A
Authority: CN
Inventors: 俞洪; 宋姗姗; 吴子丰; 俞益洲; 李一鸣; 乔昕
Original assignee: Beijing Shenrui Bolian Technology Co Ltd; Shenzhen Deepwise Bolian Technology Co Ltd
Current assignee: Beijing Shenrui Bolian Technology Co Ltd; Shenzhen Deepwise Bolian Technology Co Ltd
Priority date: 2023-07-05
Filing date: 2023-07-05
Publication date: 2023-11-10

Abstract

The invention provides a medical text classification method and device based on prompt learning. The method comprises the following steps: acquiring prompt information for primary classification from an original medical text based on event priori information and knowledge priori information, wherein the primary classification comprises department categories; filtering the original medical text, integrating the filtered text with the prompt information, and inputting the integrated text into a large language generation model; and calculating the similarity between the result sequence output by the large language generation model and each secondary classification label representing the disease category under the primary classification, and taking the label category corresponding to the maximum value of the similarity as the secondary category output by the large language generation model. The invention can realize primary classification based on department category, and can realize secondary classification based on disease category under the primary classification, thereby being more in line with the general cognition in the medical field and enabling the classification result to be more standardized; meanwhile, the classification labels can be unfixed, and the multi-stage text classification of the open domain can be effectively realized.

Description

Medical text classification method and device based on prompt learning

Technical Field

The invention belongs to the technical field of text processing, and particularly relates to a medical text classification method and device based on prompt learning.

Background

The classification of medical texts with special information and domain knowledge has always been very challenging, and ambiguity and conflict in text category may exist between clauses due to narrative modes such as terms, medical experience, contrast turning, etc., which causes difficulty in text classification of the whole text by the model. Various deep learning models for text classification have been proposed by research institutions at home and abroad for a long time, but small models cannot provide more accurate results due to the restrictions of model analysis capability and hardware operation speed, and large models have larger deployment difficulty. In recent years, with the rapid development of large language models, new forms of natural language processing such as knowledge migration, fine tuning and prompt learning are proposed, so that small models can easily provide convenient and rapid text classification products and services by means of mass parameters of the large language models.

The existing medical text classification method based on templated prompt learning generally comprises the steps of splicing an initial input text and prompt contents, feeding the spliced initial input text and the prompt contents into a template generation encoder, and further updating the prompt contents in the encoder model along with the initial input text. And then extracting the updated template parameters, splicing the updated template parameters with the initial input text which is not updated, feeding the initial input text into a large-scale pre-training language model, and improving the performances such as the accuracy of text classification by means of mass parameters and training examples of the large-scale pre-training language model. However, since the method updates the template parameters to be the encoder based on the multi-head self-attention network, the data source is a massive universal text corpus, the method has good effect on general text information processing, has the problem of lacking knowledge related to the medical field, has a plurality of deviations and shortages on knowledge understanding in the field, and therefore, the understanding of the context and the semantics of the medical text can be limited. In addition, the method does not consider important medical text information related to facts (such as existing medical entities, important index value information and relationships among entities), and the lack of text information is not beneficial to classifying text contents in a special context. The ChatGPT model generates classification labels, although a text classification mode of an open domain can be realized, and various non-given label contents are given. However, due to lack of enhancement and standardization of medical knowledge, output categories are generally not controllable, and generally only a first-level classification label is obtained, so that more advanced judgment on a certain disease is not accurate enough.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a medical text classification method and device based on prompt learning.

In order to achieve the above object, the present invention adopts the following technical scheme.

In a first aspect, the present invention provides a medical text classification method based on prompt learning, comprising the steps of:

acquiring prompt information for primary classification from an original medical text based on event priori information and knowledge priori information, wherein the primary classification comprises department categories;

filtering the original medical text, integrating the filtered text with the prompt information, and inputting the integrated text into a large language generation model;

and calculating the similarity between the result sequence output by the large language generation model and each secondary classification label representing the disease category under the primary classification, and taking the label category corresponding to the maximum value of the similarity as the secondary category output by the large language generation model.

Further, the method for acquiring the prompt message comprises the following steps:

inputting an original medical text into a medical event feature extraction model to obtain structured event prompt information A;

inputting the original medical text into a medical knowledge feature extraction model to obtain structured knowledge prompt information B;

and integrating A, B and outputting to a prompt template generation module to obtain prompt information for primary classification.

Still further, the medical event feature extraction model is a pre-trained RoBERTa model, with fine-tuning training using a dataset composed of published medical information.

Further, the medical knowledge feature extraction model is a pre-trained BioBERT model, and fine tuning training is performed by using a data set consisting of the disclosed medical knowledge graph data.

Further, the hint template generating module is formed by 4 layers of identical networks, and each layer of network adopts a transducer decoder to remove a second attention layer structure.

Further, the original medical text is filtered by entering a text filter; the text filter is a classifier adopting a ConvNeXt Tiny network, and training of the text filter is realized by marking each sentence as a valid sentence or an invalid sentence through manual data marking at sentence level on the medical text.

Further, before integrating the prompt information with the filtered text, the space mapping module maps the feature space where the prompt information is located to the feature space where the large language generation model is input and understood.

Further, the primary classification includes at least cardiovascular, orthopedic and neurologic families; secondary classifications under the vascular department include at least coronary heart disease, ischemic heart disease, stroke, aortic aneurysm and renal arterial stenosis; the secondary classification under orthopaedics at least comprises patellar fracture, ulnar nerve injury, congenital hip varus and infectious costal chondritis; secondary classification of neurology includes at least cerebral hemorrhage, cerebral embolism, headache, and myelitis.

Further, the method for calculating the similarity comprises the following steps:

embedding the labels to be classified into representations by using a pre-trained Word vector model Word2Vec, and recording as (x) ₁ ,x ₂ ,…,x _n )；

The result sequence output by the large language generation model is recorded as (y) ₁ ,y ₂ ,…,y _n )；

The following calculation (x ₁ ,x ₂ ,…,x _n ) And (y) ₁ ,y ₂ ,…,y _n ) Cosine similarity of (c):

wherein K is _cos Is cosine similarity.

In a second aspect, the present invention provides a medical text classification apparatus based on prompt learning, comprising:

the prompt information acquisition module is used for acquiring prompt information for primary classification from the original medical text based on event priori information and knowledge priori information, wherein the primary classification comprises department categories;

the information integration module is used for filtering the original medical text, integrating the filtered text with the prompt information and inputting the integrated text into the large language generation model;

and the secondary classification module is used for calculating the similarity between the result sequence output by the large language generation model and each secondary classification label representing the disease category under the primary classification, and taking the label category corresponding to the maximum value of the similarity as the secondary category output by the large language generation model.

Compared with the prior art, the invention has the following beneficial effects.

According to the invention, prompt information for primary classification including department categories is obtained from an original medical text based on event prior information and knowledge prior information, the original medical text is filtered, the filtered text and the prompt information are integrated and then input into a large language generation model, the similarity between a result sequence output by the large language generation model and each secondary classification label representing a disease category under the primary classification is calculated, and the label category corresponding to the maximum value of the similarity is used as the secondary category output by the large language generation model, so that medical text classification based on prompt learning is realized. The invention not only can realize the primary classification based on the class of departments, but also can realize the secondary classification based on the class of diseases under the primary classification, thereby being more in line with the general cognition in the medical field and enabling the classification result to be more standardized; meanwhile, the classification labels can be unfixed, and the multi-stage text classification of the open domain can be effectively realized.

Drawings

Fig. 1 is a flowchart of a medical text classification method based on prompt learning according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of two stages of another embodiment of the present invention.

Fig. 3 is a schematic structural diagram of the template generation module.

Fig. 4 is a block diagram of a medical text classification apparatus based on prompt learning according to an embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the drawings and the detailed description below, in order to make the objects, technical solutions and advantages of the present invention more apparent. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a flowchart of a medical text classification method based on prompt learning according to an embodiment of the invention, including the following steps:

step 101, acquiring prompt information for primary classification from an original medical text based on event priori information and knowledge priori information, wherein the primary classification comprises department categories;

step 102, filtering an original medical text, integrating the filtered text with the prompt information, and inputting the integrated text into a large language generation model;

and 103, calculating the similarity between a result sequence output by the large language generation model and each secondary classification label representing the disease category under the primary classification, and taking the label category corresponding to the maximum value of the similarity as the secondary category output by the large language generation model.

In this embodiment, step 101 is mainly used to obtain the prompt information for the first class classification. Prompt learning is to excite the factual knowledge in the pre-training language model by using a template, and then map the obtained factual knowledge as an answer space to a target space so as to complete a required downstream task. The prompt learning can effectively help to realize the fine adjustment of the downstream task by using a large language model, and the efficient transfer learning of parameters is realized by using little fine adjustment calculation force. The existing text classification technology based on prompt learning has two problems: one is to adopt the content of the hard template (discrete prompt template, an actual text character string) to carry on prompt learning, the hard template builds subjectivity strong, express the content limitation, the model can not fully utilize the information in the text, result in predicting the lower accuracy; the other is to introduce a soft prompt template (continuous prompt template, which is directly described in an embedded space of a bottom language model, usually vector representation) without visual interpretation, lack of prior knowledge, realize the tuning process by only relying on data of a downstream task, hardly obtain better performance, and have relatively high training cost. For this purpose, the present embodiment obtains a soft hint template based on knowledge extraction information and event extraction results. The template can effectively obtain the main events of the text medicine and the related knowledge of the medicine, and better optimize the classification effect of the large language generation model. Meanwhile, the text can be classified at one stage according to the soft prompt information, and one-stage class labels of the text can be primarily obtained. The first class classification of this embodiment mainly includes department categories, such as cardiovascular department, orthopedics, and the like.

In this embodiment, step 102 is mainly used for integrating the original medical text with the prompt message. The original medical text generally adopts an unstructured free text description mode, a large amount of invalid information and fields, such as unit information and addresses which are irrelevant to the text information, text watermarks, some data interfaces or special fields in an HTML file, and the like, are usually present, so that the effective information density of one text is low, and a large amount of uncertain noise is introduced by directly acquiring all text features, so that the core content of the text is weakened. For this reason, the original medical text is filtered first, and the filtered text after filtering the invalid sentence is integrated with the prompt message. The integration is to splice the two part feature vectors together.

For ease of understanding, a piece of original medical text is extracted below:

contrast 2016-11-29CT after right lung cancer chemotherapy: the upper right lung lesions were slightly smaller than the anterior extent and the peripheral small inflammation was slightly smaller than the anterior extent. Both lungs are scattered in the nodules, approximately the same as before. The subclavian area, the mediastinal multiple lymph nodes on the left side, is slightly reduced in consideration of metastasis. A hepatic cyst. Small cyst of left kidney. Contrast 2016-11-29CT after right lung cancer chemotherapy: the upper right lung has irregular nodular shape and flaky focus, the boundary is not clear, the maximum layer size is about 12mm multiplied by 8mm, the edge is in a foliated shape, the uneven reinforcement of scanning is enhanced, the oblique pleura is tightly attached, the partial range is slightly reduced than the anterior range, the upper right lung has a little plaque flaky slightly higher density shadow, and the boundary is not clear, and is obviously reduced than the anterior range. The right kidney and the two kidney glands were not abnormal. There were no enlarged lymph nodes behind the diaphragm feet and beside the abdominal aorta. No signs of bone destruction were seen in the scan range.

In this embodiment, step 103 is mainly used for outputting the second class. In this embodiment, the integrated information obtained in step 102 is input into the large language generation model, the similarity between the result sequence output by the large language generation model and the secondary classification label is calculated, and the label class corresponding to the maximum value of the similarity is used as the output secondary class. Large language generative models refer to language models that contain a large number (hundreds of billions or more) of parameters that need to be trained on a large number of text data, such as model GPT-3.GPT-3 is obtained by stacking multiple layers of the transducer's Decoder layer (with the 2 nd attention structure removed). GPT-3 has very powerful functions in text generation, and can achieve a very surprise effect in very complex NLP tasks, so that the GPT-3 is a great development for creating solutions in the fields of creative novel, stories, resume, narrative, chat robots, text abstracts and the like. The model fine adjustment is not needed through supervised learning to complete the tasks; for a new task, GPT-3 requires very little data to understand the needs of the task and reach a method that is near or beyond the current optima. The secondary classification of this embodiment is a disease category under the primary classification (department category), such as coronary heart disease, ischemic heart disease, and the like. In order to realize open domain classification, the classification labels can be unfixed, and when the classification labels are acquired and can be embedded in a learning way, vector representation is carried out on the secondary classification labels, and representation and similarity calculation are carried out on the corresponding primary classification labels and the corresponding combined labels of the secondary classification labels. Therefore, the embodiment not only can realize the first-level classification, but also can realize the second-level open domain classification under the first-level classification.

As an optional embodiment, the method for obtaining the prompt information includes:

The embodiment provides a technical scheme for acquiring the prompt information. As shown in fig. 2, in this embodiment, first, an original medical text is input into a medical event feature extraction model and a medical knowledge feature extraction model, respectively, and event prompt information and knowledge prompt information are extracted, respectively. For example, after inputting a sentence "boundary is unclear, the maximum layer size is about 12mm×8mm" into the medical event feature extraction model, output "primary lesion size: 12mm x 8mm "; and inputting the triplet < lung malignant tumor, disease corresponding medicine and gefitinib > into a medical knowledge feature extraction model, and outputting the "oncology". And then integrating (splicing) the extracted event prompt information and the knowledge prompt information, and then sending the integrated (spliced) event prompt information and the knowledge prompt information to a template generation module to obtain prompt information for primary classification.

As an alternative embodiment, the medical event feature extraction model is a pre-trained RoBERTa model, and the fine-tuning training is performed by using a data set composed of the disclosed medical information.

The embodiment provides a network structure and a training method of a medical event feature extraction model. The medical event feature extraction model of this embodiment is a pre-trained RoBERTa model. BERT is a classical network in deep learning, and a BERT-based network model performs well in the field of natural language processing, and has a remarkable effect on text information characterization and feature coding in particular. RoBERT belongs to an enhanced version of BERT and is also a finer tuning version of the BERT model. RoBERT uses more parameters, more training data, and introduces dynamic masks and a larger vocabulary for training. A great deal of literature and experiments show that the RoBERTa model has excellent performance in the task of information extraction. Therefore, roBERT is selected as the medical event feature extraction model in this embodiment. The embodiment utilizes the disclosed medical information to construct a training data set, adopts the data set to fine tune a pre-trained RoBERTa model, and can be used for extracting the structured event information from the unstructured text after training is completed.

As an alternative embodiment, the medical knowledge feature extraction model is a pretrained BioBERT model, and the fine tuning training is performed by using a data set composed of the disclosed medical knowledge-graph data.

The embodiment provides a network structure and a training method of a medical knowledge feature extraction model. The medical knowledge feature extraction model of this embodiment is a pre-trained BioBERT model. The BioBERT model is also a finer tuning version of the BERT model. The BioBERT is a pre-trained biomedical language representation model for biomedical text mining, and the model contains a large amount of medical text information through the injection of reinforced medical knowledge, and is excellent in biomedical text mining task, so that the BioBERT is selected as a medical knowledge feature extraction model in the embodiment. The embodiment utilizes publicly available large-scale medical knowledge graph data to finely tune the model on a BioBERT pre-training model, adopts a knowledge department multi-label classification task, and can be used for extracting structural knowledge information from unstructured text after training is completed.

As an alternative embodiment, the hint template generating module is formed of 4 layers of identical networks, each layer of network employing a transducer decoder to eliminate the second attention layer structure.

The embodiment provides a network structure of the template generation module. The template generation module of this embodiment is formed by stacking 4 layers of identical networks, each layer of network employing a transducer decoder to remove the second attention layer structure. The specific structure is shown in fig. 3.

As an alternative embodiment, the original medical text is filtered by entering a text filter; the text filter is a classifier adopting a ConvNeXt Tiny network, and training of the text filter is realized by marking each sentence as a valid sentence or an invalid sentence through manual data marking at sentence level on the medical text.

The embodiment provides a technical scheme for filtering the original medical text. The present embodiment filters the original medical text by providing a text filter. The text filter is a classifier and outputs two classes of valid sentences and invalid sentences. According to the embodiment, manual data marking at sentence level is carried out on medical texts, classification marking (effective sentences and ineffective sentences) is carried out on each sentence, classification task training is carried out by adopting a ConvNeXt Tiny network, and a trained model is a text filter. And judging sentence by sentence after the text of one section passes through a text filter, reserving effective sentences, and removing ineffective sentences. For example, the following piece of original medical text is entered into the text filter:

anterior basal segment nodule of right lower lung: surrounding lung cancer is highly likely. Proximal nodules in the left lower anterior-internal basal segment: transfer? Follow-up after treatment is recommended. Mediastinal lymph node enlargement. Bronchitis. Multiple cysts of double kidneys. Auditing physicians: liu Haipeng; the reporting physician: liu Haipeng; the issuing physician: chou. Reporting date: 14.6.2019.11:22:03. This report is for clinical reference only, and the unpublished physician (electronic) signature is considered invalid. XIMC-06-ZY-YL-0142018-A1.

The text filter outputs the following text:

anterior basal segment nodule of right lower lung: surrounding lung cancer is highly likely. Proximal nodules in the left lower anterior-internal basal segment: transfer? Follow-up after treatment is recommended. Mediastinal lymph node enlargement. Bronchitis. Multiple cysts of double kidneys. Reporting date: 14.6.2019.11:22:03.

It can be seen that filtering text deletes the invalid sentence "auditor: liu Haipeng; the reporting physician: liu Haipeng; the issuing physician: chou "and" this report is for clinical reference only, and the unpublished physician (electronic) signature is considered invalid. XIMC-06-ZY-YL-0142018-A1).

As an alternative embodiment, before integrating with the filtered text, the prompt message maps the feature space where the prompt message is located to the feature space where the large language generation model is input and understandable through a space mapping module.

In this embodiment, a space mapping module is further provided before the large language generating model, and is used for mapping the feature space where the prompt information is located to the feature space where the large language generating model is input and understandable. The space mapping module comprises a transducer coding layer and a full connection layer.

As an alternative embodiment, the first class classification includes at least cardiovascular, orthopedic and neurologic families; secondary classifications under the vascular department include at least coronary heart disease, ischemic heart disease, stroke, aortic aneurysm and renal arterial stenosis; the secondary classification under orthopaedics at least comprises patellar fracture, ulnar nerve injury, congenital hip varus and infectious costal chondritis; secondary classification of neurology includes at least cerebral hemorrhage, cerebral embolism, headache, and myelitis.

The present embodiment provides several specific primary and secondary classification categories. It should be noted that this embodiment is only a preferred embodiment, and does not exclude or negate other possible embodiments, such as different departments and different disease categories.

As an optional embodiment, the method for calculating the similarity includes:

wherein K is _cos Is cosine similarity.

The embodiment provides a technical scheme for calculating the similarity between the result sequence output by the large language generation model and the secondary classification label. In this embodiment, word2Vec is used to perform embedded representation on the tag to be classified, and then cosine similarity between the result sequence and the embedded representation is calculated. The calculation formula is as above.

Fig. 4 is a schematic diagram of a medical text classification apparatus for prompting learning according to an embodiment of the present invention, the apparatus includes:

the prompt information acquisition module 11 is configured to acquire prompt information for a first class classification from the original medical text based on event priori information and knowledge priori information, where the first class classification includes a department class;

the information integration module 12 is used for filtering the original medical text, integrating the filtered text with the prompt information and inputting the integrated text into the large language generation model;

and the secondary classification module 13 is used for calculating the similarity between the result sequence output by the large language generation model and each secondary classification label representing the disease category under the primary classification, and taking the label category corresponding to the maximum value of the similarity as the secondary category output by the large language generation model.

The device of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and its implementation principle and technical effects are similar, and are not described here again.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. The medical text classification method based on prompt learning is characterized by comprising the following steps of:

2. The prompt learning-based medical text classification method of claim 1, wherein the method of obtaining the prompt information comprises:

3. The prompt learning based medical text classification method of claim 2 wherein the medical event feature extraction model is a pre-trained RoBERTa model, fine-tuning training using a dataset composed of published medical information.

4. The prompt learning based medical text classification method of claim 2, wherein the medical knowledge feature extraction model is a pre-trained BioBERT model, and fine tuning training is performed using a dataset composed of published medical knowledge-graph data.

5. The prompt learning based medical text classification method of claim 2 wherein said prompt template generation module is comprised of 4 layers of identical networks, each layer of network employing a transducer decoder to remove a second attention layer structure.

6. The prompt learning based medical text classification method as claimed in claim 1, wherein the original medical text is filtered by inputting a text filter; the text filter is a classifier adopting a ConvNeXt Tiny network, and training of the text filter is realized by marking each sentence as a valid sentence or an invalid sentence through manual data marking at sentence level on the medical text.

7. The prompt learning-based medical text classification method according to claim 1, wherein the prompt information maps the feature space of the filtered text to the feature space understandable by the large language generation model input through a space mapping module before the prompt information is integrated with the text.

8. The prompt learning based medical text classification method of claim 1, wherein the primary classification comprises at least cardiovascular, orthopedic and neurology; secondary classifications under the vascular department include at least coronary heart disease, ischemic heart disease, stroke, aortic aneurysm and renal arterial stenosis; the secondary classification under orthopaedics at least comprises patellar fracture, ulnar nerve injury, congenital hip varus and infectious costal chondritis; secondary classification of neurology includes at least cerebral hemorrhage, cerebral embolism, headache, and myelitis.

9. The prompt learning-based medical text classification method as claimed in claim 1, wherein the similarity calculation method comprises:

wherein K is _cos Is cosine similarity.

10. A prompt learning-based medical text classification apparatus, comprising: