CN112347773A

CN112347773A - Medical application model training method and device based on BERT model

Info

Publication number: CN112347773A
Application number: CN202011159163.6A
Authority: CN
Inventors: 刘静; 周永杰; 王则远
Original assignee: Beijing Nuodao Cognitive Medical Technology Co ltd
Current assignee: Beijing Nuodao Cognitive Medical Technology Co ltd
Priority date: 2020-10-26
Filing date: 2020-10-26
Publication date: 2021-02-09

Abstract

The embodiment of the invention provides a medical application model training method and a medical application model training device based on a BERT model, wherein the method comprises the following steps: acquiring a evidence-based medical training sample; carrying out entity vocabulary shielding on the evidence-based medical training sample to obtain an MLM training sample; performing MLM training on the BERT model by using the MLM training samples to obtain a PICO-BERT model; wherein the entity vocabulary corresponds to entities having practical significance in evidence-based medicine; MLM training is carried out by using MLM training samples obtained by shielding solid vocabularies to obtain a PICO-BERT model, so that the overall semantic representation capability of the model is enhanced, the semantic comprehension capability of the trained PICO-BERT model is stronger, the natural language problem processing capability of complex scenes in a specific field is stronger, the method can be better applied to the medical field, and the comprehension capability of natural language in specific research scenes in the medical field is improved.

Description

Medical application model training method and device based on BERT model

Technical Field

The invention relates to the technical field of natural language processing, in particular to a medical application model training method and device based on a BERT model.

Background

In the field of natural language processing, a pre-training language model creates a new research paradigm, and refreshes the best level of multiple natural language processing tasks. The Pre-training language model is that the language model is Pre-trained (Pre-training) based on a large amount of unsupervised corpora, and then Fine-tuning (Fine-tuning) is performed by using a small amount of labeled corpora to complete downstream NLP tasks such as text classification, sequence labeling, machine translation, reading and understanding, and the like.

The pre-training Language model BERT introduces two pre-training tasks of MLM (masked Language model) and NSP (Next Session Prediction, NSP), performs pre-training on a larger scale corpus, and refreshes the best index on 11 natural Language understanding tasks. In order to ensure the universality of the BERT model, the large-scale corpus based on the BERT covers all knowledge fields, the pre-training language model trained based on the corpus can be used for solving natural language problems in different fields, but the pre-training language model trained by the corpus can be generally represented in some professional fields and cannot be used for adaptively solving the natural language problems in the professional fields.

Although the current pre-training language model performs well in a general field, the current pre-training language model cannot well solve the problem of natural language processing in a professional field because the large-scale corpus based on the pre-training language model is not specific to a specific field. In the medical field, the defect is particularly serious, and because the professionality in the medical field is extremely strong and the fault tolerance degree of a deep learning model used in the medical field is lower, the currently commonly used pre-training language models such as BERT and the like have poor applicability in the medical field, and the natural language problem in certain specific research scenes in the medical field cannot be solved.

Therefore, how to provide a model scheme, which can be better applied in the medical field, to improve the comprehension ability of natural language in a specific research scenario in the medical field is a technical problem to be urgently solved by those skilled in the art.

Disclosure of Invention

The embodiment of the invention provides a method and a device for training a medical application model based on a BERT model, which can be better applied to the medical field and improve the comprehension capability of natural language in a specific research scene in the medical field.

The embodiment of the invention provides a medical application model training method based on a BERT model, which comprises the following steps:

acquiring a evidence-based medical training sample;

carrying out entity vocabulary shielding on the evidence-based medical training sample to obtain an MLM training sample;

performing MLM training on the BERT model by using the MLM training samples to obtain a PICO-BERT model;

wherein the entity vocabulary corresponds to entities having practical significance in evidence-based medicine.

Further, the obtaining evidence-based medical training samples comprises:

acquiring a medical literature;

extracting PICO entities in the medical literature; the PICO entity includes: object of problem, intervention measure, alternative measure, result;

the PICO entity in each medical literature is determined as a evidence-based medical training sample.

Further, the performing entity vocabulary shielding on the evidence-based medical training sample to obtain an MLM training sample includes:

performing word segmentation on the evidence-based medical training sample to obtain a word segmentation result;

aligning the word segmentation result with a PICO entity to obtain an alignment result;

shielding the evidence-based medical training sample by using the entity vocabulary in the alignment result to obtain an MLM training sample;

wherein, the PICO entity is an entity with actual natural meaning in evidence-based medicine.

Further, the entity vocabulary includes at least one of: disease category vocabulary, medicine vocabulary, biological enzyme vocabulary and pathological reaction vocabulary.

Further, the obtaining the PICO-BERT model by performing MLM training on the BERT model using the MLM training samples includes:

placing the BERT model in a training mode;

inputting the MLM training samples into the BERT model to obtain a trained BERT model;

and determining the trained BERT model as a PICO-BERT model.

Further, the NSP training task in the PICO-BERT model is deleted.

In a second aspect, an embodiment of the present invention provides a medical application model training apparatus based on a BERT model, including:

the system comprises a sample acquisition module, a data acquisition module and a data processing module, wherein the sample acquisition module is used for acquiring evidence-based medical training samples;

the sample processing module is used for carrying out entity vocabulary shielding on the evidence-based medical training sample to obtain an MLM training sample;

the model training module is used for carrying out MLM training on the BERT model by utilizing the MLM training samples to obtain a PICO-BERT model;

Further, the sample processing module comprises:

the word segmentation unit is used for segmenting words of the evidence-based medical training sample to obtain word segmentation results;

the alignment unit is used for aligning the word segmentation result with the PICO entity to obtain an alignment result;

the shielding unit is used for shielding the evidence-based medical training sample by using the entity vocabulary in the alignment result to obtain an MLM training sample;

Embodiments of the present invention further provide an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the steps of the BERT model-based medical application model training method described in any of the above are implemented.

Embodiments of the present invention further provide a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the BERT model-based medical application model training method as described in any of the above.

According to the medical application model training method and device based on the BERT model, MLM training is carried out on the BERT model by using the evidence-based medical training sample, and during training, entity vocabulary shielding is carried out on the evidence-based medical training sample to obtain an MLM training sample; MLM training is carried out by using MLM training samples obtained by shielding solid vocabularies to obtain a PICO-BERT model, so that the overall semantic representation capability of the model is enhanced, the semantic comprehension capability of the trained PICO-BERT model is stronger, the natural language problem processing capability of complex scenes in a specific field is stronger, the method can be better applied to the medical field, and the comprehension capability of natural language in specific research scenes in the medical field is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart of a medical application model training method based on a BERT model according to an embodiment of the present invention;

FIG. 2 is a sample acquisition flowchart of a method for training a BERT-model-based medical application model according to an embodiment of the present invention;

FIG. 3 is a sample processing flow chart of a method for training a BERT model-based medical application model according to an embodiment of the present invention;

FIG. 4 is a model training flowchart of a medical application model training method based on a BERT model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating a BERT model-based medical application model training apparatus according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a sample processing module of a BERT model-based medical application model training apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A method for training a BERT model-based medical application model according to an embodiment of the present invention is described below with reference to fig. 1 to 4.

Fig. 1 is a flowchart of a medical application model training method based on a BERT model according to an embodiment of the present invention; FIG. 2 is a sample acquisition flowchart of a method for training a BERT-model-based medical application model according to an embodiment of the present invention; FIG. 3 is a sample processing flow chart of a method for training a BERT model-based medical application model according to an embodiment of the present invention; fig. 4 is a model training flowchart of a medical application model training method based on a BERT model according to an embodiment of the present invention.

In a specific embodiment of the present invention, an embodiment of the present invention provides a medical application model training method based on a BERT model, including:

step S11: acquiring a evidence-based medical training sample;

in the embodiment of the invention, firstly, a evidence-based medicine training sample needs to be obtained, wherein the evidence-based medicine is to closely combine the three essential evidence-based medicines, namely the best clinical evidence, the skilled clinical experience and the specific situation of a patient, to find and collect the best clinical evidence, and aims to obtain a more sensitive and reliable diagnosis method, a more effective and safer treatment scheme and strive to enable the patient to obtain the best treatment result. Therefore, the research of syndrome-oriented medicine covers the whole process of diagnosis and treatment, and can fully embody the actual value of medical knowledge. PICO is a search method for formatting information based on evidence-based medicine (EBM) theory. Abbreviations for partitionants, intersections, compositions, utcoms. P is the subject of the problem (patient or population), I is the intervention (e.g. diagnostic treatment), C is the other candidate (compare), and O is the outcome (outome, the diagnosis and treatment effect of the intervention). In every medical literature, there is a PICO.

Step S12: carrying out entity vocabulary shielding on the evidence-based medical training sample to obtain an MLM training sample;

the embodiment of the invention also provides a syndrome-oriented medical training sample, which is specially processed, and particularly, a knowledge graph in the syndrome-oriented medical field can be built based on syndrome-oriented medical knowledge, and PICO entities contained in the syndrome-oriented medical knowledge graph are extracted by an entity extraction method, wherein the entities are medical entities with syndrome-oriented medical significance.

The prior BERT pre-training process comprises two different pre-training tasks, namely a Masked Language Model task and a Next sequence Prediction task. The Masked Language Model (MLM) trains a bi-directional Language Model by masking some words at random, then predicting those Masked words, and referencing the characterization of each word to context information. However, when the original BERT is covered, one word is covered for English and one word is covered for Chinese, the modeling strategy focusing on the single words only learns the co-occurrence relation among the single words in the entity, and does not learn the whole semantic representation of the entity.

Step S13: performing MLM training on the BERT model by using the MLM training samples to obtain a PICO-BERT model; wherein the entity vocabulary corresponds to entities having practical significance in evidence-based medicine.

In addition, in the Masked Language Model (MLM) task, when randomly masking, a word corresponding to the entity of "mask" is selected instead of a mask word. This requires us to perform word segmentation on the speech material and align the word segmentation results with the PICO entities extracted from the evidence-based medical map prior to pre-training. For example: the correlation research of the gene polymorphism of the methylenetetrahydrofolate reductase (I) and the adverse reaction of the methotrexate (P) chemotherapy medicine of the patient with the acute lymphoblastic leukemia (P), and the PICO-BERT needs to accurately predict the methylenetetrahydrofolate reductase, the leukemia and the methotrexate according to the information of the gene polymorphism, the chemotherapy medicine, the adverse reaction and the like. And the entity knowledge is merged into the pre-training process of the PICO-BERT model in an entity shielding mode. PICO-BERT can learn semantic representations of entities such as "tetrahydrofolate reductase", "leukemia" and "methotrexate", and its association with other entities in the context, enhancing the model semantic characterization capability.

In addition, in the BERT model, a Next sequence Prediction task exists, and a Next Sequence Prediction (NSP) task is introduced in order to train a model for understanding the relationship between sentences. However, after the NSP task is removed, the model effect is not influenced in the embodiment of the invention, so that the NSP training task in the PICO-BERT model can be deleted, and the model is simplified.

Specifically, in order to obtain evidence-based medical training samples, the following steps may be performed:

step S21: acquiring a medical literature;

step S22: extracting PICO entities in the medical literature; the PICO entity includes: object of problem, intervention measure, alternative measure, result;

step S23: the PICO entity in each medical literature is determined as a evidence-based medical training sample.

That is to say, in the existing medical literature, there are a large amount of available data in the case, but the data are not in a form that the BERT model can be directly used, so in the embodiment of the present invention, the data having value in the medical literature are extracted according to the format of the PICO, so as to obtain a piece of evidence-based medical training sample. Of course, the training sample is not necessarily a medical literature, and may be a training sample in the PICO format obtained by other methods such as case practice in a hospital.

Further, for performing entity vocabulary masking on the evidence-based medical training sample, obtaining an MLM training sample comprises:

step S31: performing word segmentation on the evidence-based medical training sample to obtain a word segmentation result;

step S32: aligning the word segmentation result with a PICO entity to obtain an alignment result;

step S33: shielding the evidence-based medical training sample by using the entity vocabulary in the alignment result to obtain an MLM training sample; wherein, the PICO entity is an entity with actual natural meaning in evidence-based medicine.

That is to say, can carry out the word segmentation to evidence-based medical training sample, and the PICO entity aligns, can utilize the PICO entity that evidence-based medical atlas was drawed to establish the PICO dictionary specifically, when carrying out the word segmentation of evidence-based medical sample and aligning, can utilize the PICO vocabulary entry in the PICO dictionary to align to when carrying out MLM training sample's random shielding, use the vocabulary entry of the PICO entity in the PICO dictionary to shield, thereby obtain the MLM training sample. Specifically, the entity vocabulary includes at least one of: disease category vocabulary, medicine vocabulary, biological enzyme vocabulary and pathological reaction vocabulary.

On the basis of any of the above embodiments, in this embodiment, in order to perform MLM training on the BERT model by using the MLM training samples, the following steps may be specifically performed to obtain the PICO-BERT model:

step S41: placing the BERT model in a training mode;

step S42: inputting the MLM training samples into the BERT model to obtain a trained BERT model;

step S43: and determining the trained BERT model as a PICO-BERT model.

There are a number of pre-trained BERT models available, with different languages and different model sizes. For the Chinese model, in the embodiment of the invention, [ Bert-Base, Chinese ] can be used, and of course, other versions can be selected, and then the BERT model is placed in a training mode, the prepared MLM training sample is input, and the PICO-BERT model is obtained after the training is finished.

According to the embodiment, evidence-based medical research knowledge is fused into the pre-training language model, and the PICO-BERT model more adaptive to the evidence-based medical business scene is trained, so that the PICO-BERT model can adapt to the specific scene requirements in the medical field. And the PICO-BERT is landed in a plurality of service scenes through fine adjustment, thereby obtaining better service effect

The embodiment of the invention gives full play to the advantages of the field data, adds the field data to the trained PICO-BERT model to continue training for self-adaptation of the medical field, and completes field migration, so that the model is more suitable for solving the natural language problem of the medical field. The embodiment of the invention resets the pre-training task of the BERT model, not only simplifies the model and removes NSP task, but also changes the single character shielding mode in MLM task into the entity shielding mode, thereby realizing the integration of PICO entity information in evidence-based medical knowledge graph in the pre-training process, further enhancing the whole semantic representation capability of the model, leading the semantic comprehension capability of the trained PICO-BERT model to be stronger and the capability of processing the natural language problem of the complex scene in the specific field to be stronger.

Referring to fig. 5 and 6, fig. 5 is a schematic diagram illustrating a medical application model training apparatus based on a BERT model according to an embodiment of the present invention; fig. 6 is a schematic diagram of a sample processing module of a medical application model training apparatus based on a BERT model according to an embodiment of the present invention.

The medical application model training device based on the BERT model provided by the embodiment of the invention is described below, and the medical application model training device based on the BERT model described below and the medical application model training method based on the BERT model described above can be referred to correspondingly.

In another embodiment of the present invention, an embodiment of the present invention provides a medical application model training apparatus 500 based on a BERT model, including:

a sample acquisition module 510 for acquiring evidence-based medical training samples;

a sample processing module 520, configured to perform entity vocabulary shielding on the evidence-based medical training sample to obtain an MLM training sample;

the model training module 530 is used for performing MLM training on the BERT model by using the MLM training samples to obtain a PICO-BERT model;

Further, the sample processing module 520 includes:

a word segmentation unit 521, configured to perform word segmentation on the evidence-based medical training sample to obtain a word segmentation result;

an alignment unit 522, configured to align the word segmentation result with the PICO entity to obtain an alignment result;

a shielding unit 523, configured to shield the evidence-based medical training sample by using the entity vocabulary in the alignment result to obtain an MLM training sample;

Fig. 7 illustrates a physical structure diagram of an electronic device, and as shown in fig. 7, the electronic device may include: a processor (processor)710, a communication Interface (Communications Interface)720, a memory (memory)730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a BERT model-based medical application model training method comprising: acquiring a evidence-based medical training sample; carrying out entity vocabulary shielding on the evidence-based medical training sample to obtain an MLM training sample; performing MLM training on the BERT model by using the MLM training samples to obtain a PICO-BERT model; wherein the entity vocabulary corresponds to entities having practical significance in evidence-based medicine.

In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented by a processor to perform the method for training a BERT model-based medical application model provided in the foregoing embodiments, and the method includes: acquiring a evidence-based medical training sample; carrying out entity vocabulary shielding on the evidence-based medical training sample to obtain an MLM training sample; performing MLM training on the BERT model by using the MLM training samples to obtain a PICO-BERT model; wherein the entity vocabulary corresponds to entities having practical significance in evidence-based medicine.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A medical application model training method based on a BERT model is characterized by comprising the following steps:

acquiring a evidence-based medical training sample;

2. The BERT model-based medical application model training method of claim 1,

the obtaining evidence-based medical training samples includes:

acquiring a medical literature;

3. The BERT model-based medical application model training method of claim 1,

the step of carrying out entity vocabulary shielding on the evidence-based medical training sample to obtain the MLM training sample comprises the following steps:

4. The BERT model-based medical application model training method of claim 1,

the entity vocabulary includes at least one of: disease category vocabulary, medicine vocabulary, biological enzyme vocabulary and pathological reaction vocabulary.

5. The BERT model-based medical application model training method of claim 1,

the MLM training of the BERT model by using the MLM training samples to obtain the PICO-BERT model comprises the following steps:

placing the BERT model in a training mode;

and determining the trained BERT model as a PICO-BERT model.

6. The BERT model-based medical application model training method of any one of claims 1 to 5,

deleting the NSP training task in the PICO-BERT model.

7. A medical application model training device based on a BERT model is characterized by comprising:

8. The BERT model-based medical application model training apparatus of claim 7, wherein the sample processing module comprises:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the program, carries out the steps of the method for training a BERT model-based medical application model according to any of claims 1 to 6.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for training a medical application model based on a BERT model according to any one of claims 1 to 6.