CN112347773A - Medical application model training method and device based on BERT model - Google Patents

Medical application model training method and device based on BERT model Download PDF

Info

Publication number
CN112347773A
CN112347773A CN202011159163.6A CN202011159163A CN112347773A CN 112347773 A CN112347773 A CN 112347773A CN 202011159163 A CN202011159163 A CN 202011159163A CN 112347773 A CN112347773 A CN 112347773A
Authority
CN
China
Prior art keywords
training
model
evidence
bert model
pico
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011159163.6A
Other languages
Chinese (zh)
Inventor
刘静
周永杰
王则远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Nuodao Cognitive Medical Technology Co ltd
Original Assignee
Beijing Nuodao Cognitive Medical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Nuodao Cognitive Medical Technology Co ltd filed Critical Beijing Nuodao Cognitive Medical Technology Co ltd
Priority to CN202011159163.6A priority Critical patent/CN112347773A/en
Publication of CN112347773A publication Critical patent/CN112347773A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a medical application model training method and a medical application model training device based on a BERT model, wherein the method comprises the following steps: acquiring a evidence-based medical training sample; carrying out entity vocabulary shielding on the evidence-based medical training sample to obtain an MLM training sample; performing MLM training on the BERT model by using the MLM training samples to obtain a PICO-BERT model; wherein the entity vocabulary corresponds to entities having practical significance in evidence-based medicine; MLM training is carried out by using MLM training samples obtained by shielding solid vocabularies to obtain a PICO-BERT model, so that the overall semantic representation capability of the model is enhanced, the semantic comprehension capability of the trained PICO-BERT model is stronger, the natural language problem processing capability of complex scenes in a specific field is stronger, the method can be better applied to the medical field, and the comprehension capability of natural language in specific research scenes in the medical field is improved.

Description

Medical application model training method and device based on BERT model
Technical Field
The invention relates to the technical field of natural language processing, in particular to a medical application model training method and device based on a BERT model.
Background
In the field of natural language processing, a pre-training language model creates a new research paradigm, and refreshes the best level of multiple natural language processing tasks. The Pre-training language model is that the language model is Pre-trained (Pre-training) based on a large amount of unsupervised corpora, and then Fine-tuning (Fine-tuning) is performed by using a small amount of labeled corpora to complete downstream NLP tasks such as text classification, sequence labeling, machine translation, reading and understanding, and the like.
The pre-training Language model BERT introduces two pre-training tasks of MLM (masked Language model) and NSP (Next Session Prediction, NSP), performs pre-training on a larger scale corpus, and refreshes the best index on 11 natural Language understanding tasks. In order to ensure the universality of the BERT model, the large-scale corpus based on the BERT covers all knowledge fields, the pre-training language model trained based on the corpus can be used for solving natural language problems in different fields, but the pre-training language model trained by the corpus can be generally represented in some professional fields and cannot be used for adaptively solving the natural language problems in the professional fields.
Although the current pre-training language model performs well in a general field, the current pre-training language model cannot well solve the problem of natural language processing in a professional field because the large-scale corpus based on the pre-training language model is not specific to a specific field. In the medical field, the defect is particularly serious, and because the professionality in the medical field is extremely strong and the fault tolerance degree of a deep learning model used in the medical field is lower, the currently commonly used pre-training language models such as BERT and the like have poor applicability in the medical field, and the natural language problem in certain specific research scenes in the medical field cannot be solved.
Therefore, how to provide a model scheme, which can be better applied in the medical field, to improve the comprehension ability of natural language in a specific research scenario in the medical field is a technical problem to be urgently solved by those skilled in the art.
Disclosure of Invention
The embodiment of the invention provides a method and a device for training a medical application model based on a BERT model, which can be better applied to the medical field and improve the comprehension capability of natural language in a specific research scene in the medical field.
The embodiment of the invention provides a medical application model training method based on a BERT model, which comprises the following steps:
acquiring a evidence-based medical training sample;
carrying out entity vocabulary shielding on the evidence-based medical training sample to obtain an MLM training sample;
performing MLM training on the BERT model by using the MLM training samples to obtain a PICO-BERT model;
wherein the entity vocabulary corresponds to entities having practical significance in evidence-based medicine.
Further, the obtaining evidence-based medical training samples comprises:
acquiring a medical literature;
extracting PICO entities in the medical literature; the PICO entity includes: object of problem, intervention measure, alternative measure, result;
the PICO entity in each medical literature is determined as a evidence-based medical training sample.
Further, the performing entity vocabulary shielding on the evidence-based medical training sample to obtain an MLM training sample includes:
performing word segmentation on the evidence-based medical training sample to obtain a word segmentation result;
aligning the word segmentation result with a PICO entity to obtain an alignment result;
shielding the evidence-based medical training sample by using the entity vocabulary in the alignment result to obtain an MLM training sample;
wherein, the PICO entity is an entity with actual natural meaning in evidence-based medicine.
Further, the entity vocabulary includes at least one of: disease category vocabulary, medicine vocabulary, biological enzyme vocabulary and pathological reaction vocabulary.
Further, the obtaining the PICO-BERT model by performing MLM training on the BERT model using the MLM training samples includes:
placing the BERT model in a training mode;
inputting the MLM training samples into the BERT model to obtain a trained BERT model;
and determining the trained BERT model as a PICO-BERT model.
Further, the NSP training task in the PICO-BERT model is deleted.
In a second aspect, an embodiment of the present invention provides a medical application model training apparatus based on a BERT model, including:
the system comprises a sample acquisition module, a data acquisition module and a data processing module, wherein the sample acquisition module is used for acquiring evidence-based medical training samples;
the sample processing module is used for carrying out entity vocabulary shielding on the evidence-based medical training sample to obtain an MLM training sample;
the model training module is used for carrying out MLM training on the BERT model by utilizing the MLM training samples to obtain a PICO-BERT model;
wherein the entity vocabulary corresponds to entities having practical significance in evidence-based medicine.
Further, the sample processing module comprises:
the word segmentation unit is used for segmenting words of the evidence-based medical training sample to obtain word segmentation results;
the alignment unit is used for aligning the word segmentation result with the PICO entity to obtain an alignment result;
the shielding unit is used for shielding the evidence-based medical training sample by using the entity vocabulary in the alignment result to obtain an MLM training sample;
wherein, the PICO entity is an entity with actual natural meaning in evidence-based medicine.
Embodiments of the present invention further provide an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the steps of the BERT model-based medical application model training method described in any of the above are implemented.
Embodiments of the present invention further provide a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the BERT model-based medical application model training method as described in any of the above.
According to the medical application model training method and device based on the BERT model, MLM training is carried out on the BERT model by using the evidence-based medical training sample, and during training, entity vocabulary shielding is carried out on the evidence-based medical training sample to obtain an MLM training sample; MLM training is carried out by using MLM training samples obtained by shielding solid vocabularies to obtain a PICO-BERT model, so that the overall semantic representation capability of the model is enhanced, the semantic comprehension capability of the trained PICO-BERT model is stronger, the natural language problem processing capability of complex scenes in a specific field is stronger, the method can be better applied to the medical field, and the comprehension capability of natural language in specific research scenes in the medical field is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a flowchart of a medical application model training method based on a BERT model according to an embodiment of the present invention;
FIG. 2 is a sample acquisition flowchart of a method for training a BERT-model-based medical application model according to an embodiment of the present invention;
FIG. 3 is a sample processing flow chart of a method for training a BERT model-based medical application model according to an embodiment of the present invention;
FIG. 4 is a model training flowchart of a medical application model training method based on a BERT model according to an embodiment of the present invention;
FIG. 5 is a schematic diagram illustrating a BERT model-based medical application model training apparatus according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a sample processing module of a BERT model-based medical application model training apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A method for training a BERT model-based medical application model according to an embodiment of the present invention is described below with reference to fig. 1 to 4.
Fig. 1 is a flowchart of a medical application model training method based on a BERT model according to an embodiment of the present invention; FIG. 2 is a sample acquisition flowchart of a method for training a BERT-model-based medical application model according to an embodiment of the present invention; FIG. 3 is a sample processing flow chart of a method for training a BERT model-based medical application model according to an embodiment of the present invention; fig. 4 is a model training flowchart of a medical application model training method based on a BERT model according to an embodiment of the present invention.
In a specific embodiment of the present invention, an embodiment of the present invention provides a medical application model training method based on a BERT model, including:
step S11: acquiring a evidence-based medical training sample;
in the embodiment of the invention, firstly, a evidence-based medicine training sample needs to be obtained, wherein the evidence-based medicine is to closely combine the three essential evidence-based medicines, namely the best clinical evidence, the skilled clinical experience and the specific situation of a patient, to find and collect the best clinical evidence, and aims to obtain a more sensitive and reliable diagnosis method, a more effective and safer treatment scheme and strive to enable the patient to obtain the best treatment result. Therefore, the research of syndrome-oriented medicine covers the whole process of diagnosis and treatment, and can fully embody the actual value of medical knowledge. PICO is a search method for formatting information based on evidence-based medicine (EBM) theory. Abbreviations for partitionants, intersections, compositions, utcoms. P is the subject of the problem (patient or population), I is the intervention (e.g. diagnostic treatment), C is the other candidate (compare), and O is the outcome (outome, the diagnosis and treatment effect of the intervention). In every medical literature, there is a PICO.
Step S12: carrying out entity vocabulary shielding on the evidence-based medical training sample to obtain an MLM training sample;
the embodiment of the invention also provides a syndrome-oriented medical training sample, which is specially processed, and particularly, a knowledge graph in the syndrome-oriented medical field can be built based on syndrome-oriented medical knowledge, and PICO entities contained in the syndrome-oriented medical knowledge graph are extracted by an entity extraction method, wherein the entities are medical entities with syndrome-oriented medical significance.
The prior BERT pre-training process comprises two different pre-training tasks, namely a Masked Language Model task and a Next sequence Prediction task. The Masked Language Model (MLM) trains a bi-directional Language Model by masking some words at random, then predicting those Masked words, and referencing the characterization of each word to context information. However, when the original BERT is covered, one word is covered for English and one word is covered for Chinese, the modeling strategy focusing on the single words only learns the co-occurrence relation among the single words in the entity, and does not learn the whole semantic representation of the entity.
Step S13: performing MLM training on the BERT model by using the MLM training samples to obtain a PICO-BERT model; wherein the entity vocabulary corresponds to entities having practical significance in evidence-based medicine.
In addition, in the Masked Language Model (MLM) task, when randomly masking, a word corresponding to the entity of "mask" is selected instead of a mask word. This requires us to perform word segmentation on the speech material and align the word segmentation results with the PICO entities extracted from the evidence-based medical map prior to pre-training. For example: the correlation research of the gene polymorphism of the methylenetetrahydrofolate reductase (I) and the adverse reaction of the methotrexate (P) chemotherapy medicine of the patient with the acute lymphoblastic leukemia (P), and the PICO-BERT needs to accurately predict the methylenetetrahydrofolate reductase, the leukemia and the methotrexate according to the information of the gene polymorphism, the chemotherapy medicine, the adverse reaction and the like. And the entity knowledge is merged into the pre-training process of the PICO-BERT model in an entity shielding mode. PICO-BERT can learn semantic representations of entities such as "tetrahydrofolate reductase", "leukemia" and "methotrexate", and its association with other entities in the context, enhancing the model semantic characterization capability.
In addition, in the BERT model, a Next sequence Prediction task exists, and a Next Sequence Prediction (NSP) task is introduced in order to train a model for understanding the relationship between sentences. However, after the NSP task is removed, the model effect is not influenced in the embodiment of the invention, so that the NSP training task in the PICO-BERT model can be deleted, and the model is simplified.
Specifically, in order to obtain evidence-based medical training samples, the following steps may be performed:
step S21: acquiring a medical literature;
step S22: extracting PICO entities in the medical literature; the PICO entity includes: object of problem, intervention measure, alternative measure, result;
step S23: the PICO entity in each medical literature is determined as a evidence-based medical training sample.
That is to say, in the existing medical literature, there are a large amount of available data in the case, but the data are not in a form that the BERT model can be directly used, so in the embodiment of the present invention, the data having value in the medical literature are extracted according to the format of the PICO, so as to obtain a piece of evidence-based medical training sample. Of course, the training sample is not necessarily a medical literature, and may be a training sample in the PICO format obtained by other methods such as case practice in a hospital.
Further, for performing entity vocabulary masking on the evidence-based medical training sample, obtaining an MLM training sample comprises:
step S31: performing word segmentation on the evidence-based medical training sample to obtain a word segmentation result;
step S32: aligning the word segmentation result with a PICO entity to obtain an alignment result;
step S33: shielding the evidence-based medical training sample by using the entity vocabulary in the alignment result to obtain an MLM training sample; wherein, the PICO entity is an entity with actual natural meaning in evidence-based medicine.
That is to say, can carry out the word segmentation to evidence-based medical training sample, and the PICO entity aligns, can utilize the PICO entity that evidence-based medical atlas was drawed to establish the PICO dictionary specifically, when carrying out the word segmentation of evidence-based medical sample and aligning, can utilize the PICO vocabulary entry in the PICO dictionary to align to when carrying out MLM training sample's random shielding, use the vocabulary entry of the PICO entity in the PICO dictionary to shield, thereby obtain the MLM training sample. Specifically, the entity vocabulary includes at least one of: disease category vocabulary, medicine vocabulary, biological enzyme vocabulary and pathological reaction vocabulary.
On the basis of any of the above embodiments, in this embodiment, in order to perform MLM training on the BERT model by using the MLM training samples, the following steps may be specifically performed to obtain the PICO-BERT model:
step S41: placing the BERT model in a training mode;
step S42: inputting the MLM training samples into the BERT model to obtain a trained BERT model;
step S43: and determining the trained BERT model as a PICO-BERT model.
There are a number of pre-trained BERT models available, with different languages and different model sizes. For the Chinese model, in the embodiment of the invention, [ Bert-Base, Chinese ] can be used, and of course, other versions can be selected, and then the BERT model is placed in a training mode, the prepared MLM training sample is input, and the PICO-BERT model is obtained after the training is finished.
According to the embodiment, evidence-based medical research knowledge is fused into the pre-training language model, and the PICO-BERT model more adaptive to the evidence-based medical business scene is trained, so that the PICO-BERT model can adapt to the specific scene requirements in the medical field. And the PICO-BERT is landed in a plurality of service scenes through fine adjustment, thereby obtaining better service effect
The embodiment of the invention gives full play to the advantages of the field data, adds the field data to the trained PICO-BERT model to continue training for self-adaptation of the medical field, and completes field migration, so that the model is more suitable for solving the natural language problem of the medical field. The embodiment of the invention resets the pre-training task of the BERT model, not only simplifies the model and removes NSP task, but also changes the single character shielding mode in MLM task into the entity shielding mode, thereby realizing the integration of PICO entity information in evidence-based medical knowledge graph in the pre-training process, further enhancing the whole semantic representation capability of the model, leading the semantic comprehension capability of the trained PICO-BERT model to be stronger and the capability of processing the natural language problem of the complex scene in the specific field to be stronger.
Referring to fig. 5 and 6, fig. 5 is a schematic diagram illustrating a medical application model training apparatus based on a BERT model according to an embodiment of the present invention; fig. 6 is a schematic diagram of a sample processing module of a medical application model training apparatus based on a BERT model according to an embodiment of the present invention.
The medical application model training device based on the BERT model provided by the embodiment of the invention is described below, and the medical application model training device based on the BERT model described below and the medical application model training method based on the BERT model described above can be referred to correspondingly.
In another embodiment of the present invention, an embodiment of the present invention provides a medical application model training apparatus 500 based on a BERT model, including:
a sample acquisition module 510 for acquiring evidence-based medical training samples;
a sample processing module 520, configured to perform entity vocabulary shielding on the evidence-based medical training sample to obtain an MLM training sample;
the model training module 530 is used for performing MLM training on the BERT model by using the MLM training samples to obtain a PICO-BERT model;
wherein the entity vocabulary corresponds to entities having practical significance in evidence-based medicine.
Further, the sample processing module 520 includes:
a word segmentation unit 521, configured to perform word segmentation on the evidence-based medical training sample to obtain a word segmentation result;
an alignment unit 522, configured to align the word segmentation result with the PICO entity to obtain an alignment result;
a shielding unit 523, configured to shield the evidence-based medical training sample by using the entity vocabulary in the alignment result to obtain an MLM training sample;
wherein, the PICO entity is an entity with actual natural meaning in evidence-based medicine.
According to the medical application model training method and device based on the BERT model, MLM training is carried out on the BERT model by using the evidence-based medical training sample, and during training, entity vocabulary shielding is carried out on the evidence-based medical training sample to obtain an MLM training sample; MLM training is carried out by using MLM training samples obtained by shielding solid vocabularies to obtain a PICO-BERT model, so that the overall semantic representation capability of the model is enhanced, the semantic comprehension capability of the trained PICO-BERT model is stronger, the natural language problem processing capability of complex scenes in a specific field is stronger, the method can be better applied to the medical field, and the comprehension capability of natural language in specific research scenes in the medical field is improved.
Fig. 7 illustrates a physical structure diagram of an electronic device, and as shown in fig. 7, the electronic device may include: a processor (processor)710, a communication Interface (Communications Interface)720, a memory (memory)730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a BERT model-based medical application model training method comprising: acquiring a evidence-based medical training sample; carrying out entity vocabulary shielding on the evidence-based medical training sample to obtain an MLM training sample; performing MLM training on the BERT model by using the MLM training samples to obtain a PICO-BERT model; wherein the entity vocabulary corresponds to entities having practical significance in evidence-based medicine.
In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented by a processor to perform the method for training a BERT model-based medical application model provided in the foregoing embodiments, and the method includes: acquiring a evidence-based medical training sample; carrying out entity vocabulary shielding on the evidence-based medical training sample to obtain an MLM training sample; performing MLM training on the BERT model by using the MLM training samples to obtain a PICO-BERT model; wherein the entity vocabulary corresponds to entities having practical significance in evidence-based medicine.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A medical application model training method based on a BERT model is characterized by comprising the following steps:
acquiring a evidence-based medical training sample;
carrying out entity vocabulary shielding on the evidence-based medical training sample to obtain an MLM training sample;
performing MLM training on the BERT model by using the MLM training samples to obtain a PICO-BERT model;
wherein the entity vocabulary corresponds to entities having practical significance in evidence-based medicine.
2. The BERT model-based medical application model training method of claim 1,
the obtaining evidence-based medical training samples includes:
acquiring a medical literature;
extracting PICO entities in the medical literature; the PICO entity includes: object of problem, intervention measure, alternative measure, result;
the PICO entity in each medical literature is determined as a evidence-based medical training sample.
3. The BERT model-based medical application model training method of claim 1,
the step of carrying out entity vocabulary shielding on the evidence-based medical training sample to obtain the MLM training sample comprises the following steps:
performing word segmentation on the evidence-based medical training sample to obtain a word segmentation result;
aligning the word segmentation result with a PICO entity to obtain an alignment result;
shielding the evidence-based medical training sample by using the entity vocabulary in the alignment result to obtain an MLM training sample;
wherein, the PICO entity is an entity with actual natural meaning in evidence-based medicine.
4. The BERT model-based medical application model training method of claim 1,
the entity vocabulary includes at least one of: disease category vocabulary, medicine vocabulary, biological enzyme vocabulary and pathological reaction vocabulary.
5. The BERT model-based medical application model training method of claim 1,
the MLM training of the BERT model by using the MLM training samples to obtain the PICO-BERT model comprises the following steps:
placing the BERT model in a training mode;
inputting the MLM training samples into the BERT model to obtain a trained BERT model;
and determining the trained BERT model as a PICO-BERT model.
6. The BERT model-based medical application model training method of any one of claims 1 to 5,
deleting the NSP training task in the PICO-BERT model.
7. A medical application model training device based on a BERT model is characterized by comprising:
the system comprises a sample acquisition module, a data acquisition module and a data processing module, wherein the sample acquisition module is used for acquiring evidence-based medical training samples;
the sample processing module is used for carrying out entity vocabulary shielding on the evidence-based medical training sample to obtain an MLM training sample;
the model training module is used for carrying out MLM training on the BERT model by utilizing the MLM training samples to obtain a PICO-BERT model;
wherein the entity vocabulary corresponds to entities having practical significance in evidence-based medicine.
8. The BERT model-based medical application model training apparatus of claim 7, wherein the sample processing module comprises:
the word segmentation unit is used for segmenting words of the evidence-based medical training sample to obtain word segmentation results;
the alignment unit is used for aligning the word segmentation result with the PICO entity to obtain an alignment result;
the shielding unit is used for shielding the evidence-based medical training sample by using the entity vocabulary in the alignment result to obtain an MLM training sample;
wherein, the PICO entity is an entity with actual natural meaning in evidence-based medicine.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the program, carries out the steps of the method for training a BERT model-based medical application model according to any of claims 1 to 6.
10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for training a medical application model based on a BERT model according to any one of claims 1 to 6.
CN202011159163.6A 2020-10-26 2020-10-26 Medical application model training method and device based on BERT model Pending CN112347773A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011159163.6A CN112347773A (en) 2020-10-26 2020-10-26 Medical application model training method and device based on BERT model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011159163.6A CN112347773A (en) 2020-10-26 2020-10-26 Medical application model training method and device based on BERT model

Publications (1)

Publication Number Publication Date
CN112347773A true CN112347773A (en) 2021-02-09

Family

ID=74359013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011159163.6A Pending CN112347773A (en) 2020-10-26 2020-10-26 Medical application model training method and device based on BERT model

Country Status (1)

Country Link
CN (1) CN112347773A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800770A (en) * 2021-04-15 2021-05-14 南京樯图数据研究院有限公司 Entity alignment method based on heteromorphic graph attention network
CN113836919A (en) * 2021-09-30 2021-12-24 中国建筑第七工程局有限公司 Building industry text error correction method based on transfer learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427486A (en) * 2019-07-25 2019-11-08 北京百度网讯科技有限公司 Classification method, device and the equipment of body patient's condition text
CN111126068A (en) * 2019-12-25 2020-05-08 中电云脑(天津)科技有限公司 Chinese named entity recognition method and device and electronic equipment
CN111401066A (en) * 2020-03-12 2020-07-10 腾讯科技(深圳)有限公司 Artificial intelligence-based word classification model training method, word processing method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427486A (en) * 2019-07-25 2019-11-08 北京百度网讯科技有限公司 Classification method, device and the equipment of body patient's condition text
CN111126068A (en) * 2019-12-25 2020-05-08 中电云脑(天津)科技有限公司 Chinese named entity recognition method and device and electronic equipment
CN111401066A (en) * 2020-03-12 2020-07-10 腾讯科技(深圳)有限公司 Artificial intelligence-based word classification model training method, word processing method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王天罡等: "基于预训练表征模型的自动ICD编码", 中国数字医学, no. 07, 15 July 2020 (2020-07-15), pages 54 - 56 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800770A (en) * 2021-04-15 2021-05-14 南京樯图数据研究院有限公司 Entity alignment method based on heteromorphic graph attention network
CN112800770B (en) * 2021-04-15 2021-07-09 南京樯图数据研究院有限公司 Entity alignment method based on heteromorphic graph attention network
CN113836919A (en) * 2021-09-30 2021-12-24 中国建筑第七工程局有限公司 Building industry text error correction method based on transfer learning

Similar Documents

Publication Publication Date Title
CN112242187B (en) Medical scheme recommendation system and method based on knowledge graph characterization learning
CN110750959B (en) Text information processing method, model training method and related device
CN108628824A (en) A kind of entity recognition method based on Chinese electronic health record
Bień et al. RecipeNLG: A cooking recipes dataset for semi-structured text generation
EP3567605A1 (en) Structured report data from a medical text report
CN112597774B (en) Chinese medical named entity recognition method, system, storage medium and equipment
Shardlow Out in the Open: Finding and Categorising Errors in the Lexical Simplification Pipeline.
Carchiolo et al. Medical prescription classification: a NLP-based approach
CN106844351B (en) Medical institution organization entity identification method and device oriented to multiple data sources
JP2002515148A (en) System and method for medical language extraction and encoding
CN110427486B (en) Body condition text classification method, device and equipment
CN112347773A (en) Medical application model training method and device based on BERT model
CN112017744A (en) Electronic case automatic generation method, device, equipment and storage medium
Varvara et al. Grounding semantic transparency in context: A distributional semantic study on German event nominalizations
CN109299467A (en) Medicine text recognition method and device, sentence identification model training method and device
CN107122582B (en) diagnosis and treatment entity identification method and device facing multiple data sources
CN116911300A (en) Language model pre-training method, entity recognition method and device
CN112784601A (en) Key information extraction method and device, electronic equipment and storage medium
Rocha et al. A speech-to-text interface for mammoclass
CN115757815A (en) Knowledge graph construction method and device and storage medium
CN114898895A (en) Xinjiang local adverse drug reaction identification method and related device
CN114743621A (en) Medical record input prediction method, medical record input prediction device, and storage medium
CN113836892A (en) Sample size data extraction method and device, electronic equipment and storage medium
CN111091915A (en) Medical data processing method and device, storage medium and electronic equipment
Luong et al. Building a corpus for vietnamese text readability assessment in the literature domain

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination