CN112016314A - Medical text understanding method and system based on BERT model - Google Patents

Medical text understanding method and system based on BERT model Download PDF

Info

Publication number
CN112016314A
CN112016314A CN202010977191.2A CN202010977191A CN112016314A CN 112016314 A CN112016314 A CN 112016314A CN 202010977191 A CN202010977191 A CN 202010977191A CN 112016314 A CN112016314 A CN 112016314A
Authority
CN
China
Prior art keywords
medical text
medical
word
text
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010977191.2A
Other languages
Chinese (zh)
Inventor
汪秀英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202010977191.2A priority Critical patent/CN112016314A/en
Publication of CN112016314A publication Critical patent/CN112016314A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention relates to the technical field of text processing, and discloses a medical text understanding method based on a BERT model, which comprises the following steps: acquiring medical text data, and filtering invalid medical text data by using a sentence filtering model; according to the filtered medical text data, generating large-scale medical text data by using a medical text generation model based on text copy; training a medical text entity recognition model by using the generated large-scale medical field text data; performing entity recognition on the medical text to be processed by using the trained medical text entity recognition model; semantic extraction is carried out on the medical text entity by using an attention-based information extraction method to obtain semantic features of the medical text entity; and according to the semantic features of the medical text entities, understanding the medical text by using a multilayer perception machine. The invention also provides a BERT model-based medical text understanding system. The invention realizes the understanding of medical texts.

Description

Medical text understanding method and system based on BERT model
Technical Field
The invention relates to the technical field of text processing, in particular to a method and a system for medical text understanding based on a BERT model.
Background
With the increase of the economic level, people inevitably pay more attention to the health condition of the people, and the requirement on the medical service level is higher and higher. The existing medical service is limited by various factors such as resources and management, and the ever-increasing requirements of people are difficult to meet. Intelligent medical treatment becomes more and more important, and the knowledge in the medical text is fully utilized to accelerate the progress of the intelligent medical treatment.
At present, the research on the text understanding in the medical field is less, a large amount of label training data is needed in a traditional named entity recognition model based on a neural network, however, proper terms of medical field data have strong speciality and high labeling cost, so that accurate labeled data are less, and a large-scale medical field text data set is lacked. Meanwhile, the current entity recognition model is difficult to classify entities by connecting with the context and recognize medical entities due to the fact that the writing habits of doctors are greatly different.
In view of this, how to acquire a large-scale medical text data set and construct a medical entity identification model that can be effectively applied to the medical field, so as to perform medical text understanding by using identified medical entity information, is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
The invention provides a BERT model-based medical text understanding method, which comprises the steps of generating large-scale medical field text data by using a text copy-based medical text generation technology, and training a medical text entity recognition model by using the generated medical field text data, so that entity recognition is performed on a medical text to be processed by using the trained medical text entity recognition model; and semantic extraction is carried out on the medical text entity by using an information extraction method based on rules, and the understanding of the medical text is realized according to the extracted semantic information.
In order to achieve the above object, the present invention provides a medical text understanding method based on a BERT model, including:
acquiring medical text data, and filtering invalid medical text data by using a sentence filtering model;
according to the filtered medical text data, generating large-scale medical text data by using a medical text generation model based on text copy;
training a medical text entity recognition model by using the generated large-scale medical field text data;
performing entity recognition on the medical text to be processed by using the trained medical text entity recognition model;
semantic extraction is carried out on the medical text entity by using an attention-based information extraction method to obtain semantic features of the medical text entity;
and according to the semantic features of the medical text entities, understanding the medical text by using a multilayer perception machine.
Optionally, the filtering out invalid medical text data by using a sentence filtering model includes:
the sentence filtering model is a BERT-based self-attention mechanism model; the process of filtering invalid medical text data by using the sentence filtering model comprises the following steps:
1) adding a [ CLS ] mark before inputting a word sequence, adding a [ SEP ] mark after inputting the word sequence, converting the input word sequence into corresponding Token Embedding, and calculating to obtain Position Embedding corresponding to each word; adding the two Embedding corresponding to each word to obtain an input Embedding code;
2) the attention weight α of the input sequence vector is derived using a global-based attention matrix:
α=softmax(WT)
wherein:
w is a weight-based attention matrix used to assist the model in capturing the more important information for classification in the representation of the input sequence;
t is a BERT word vector;
3) multiplying the attention weight by the BERT word vector representation obtained by the corresponding word vector coding layer to obtain the attention representation of the input sequence:
Figure BDA0002686130150000021
wherein:
Tirepresenting the ith BERT word vector;
αian attention weight representing the ith BERT word vector;
4) outputting sentence filtering results based on the parameter matrix of the multilayer perceptron:
Output=sigmoid(W0attention)
wherein:
W0is a parameter matrix of the multi-layer perceptron.
Optionally, the generating of large-scale medical text data by using a medical text generation model based on text copy includes:
1) introducing an implicit variable ztControlling the model to generate words from the word list or copy the words needed to be generated currently from the text in the decoding process when z ist0 represents that the decoder needs to generate a word from the word list at the current moment, when z t1 represents that the decoder copies a word from the input text D at the present moment;
2) generating the medical text by using a decoder, wherein the probability of generating the tth word by the decoder is as follows:
Figure BDA0002686130150000031
wherein:
d is a text of a sentence filtering result;
s is a text word vector;
ytgenerating a t-th word;
ztis a hidden variable used for controlling the model to generate words from a word list or copy words required to be generated currently from a text in the decoding process when z ist0 generation ═ 0 generationThe table decoder needs to generate a word from the word table at the current moment, when z t1 represents that the decoder copies a word from the input text D at the current moment.
Optionally, the training process of performing the medical text entity recognition model by using the generated large-scale medical field text data includes:
1) marking [ MASK ] on the input token by using a bidirectional masked language model and adopting a random marking [ MASK ] method, and predicting the token marked with [ MASK ] by using context;
2) randomly selecting two sentences from the medical text data, and if the [ MASK ] marks of the two sentences are marked as context marks, considering that one sentence is the next sentence of the other sentence;
3) the above steps are repeated until 30% of the medical text data is marked [ MASK ].
Optionally, the performing entity recognition on the medical text by using the trained medical text entity recognition model includes:
1) performing word segmentation processing on the text data in the large-scale medical field by adopting a jieba word segmentation tool;
2) by calculating the word frequency of each word in the word segmentation result, replacing the segmented words with higher word frequency by using smaller characters, introducing a word boundary symbol, and combining the word boundary symbol with a plurality of divided word groups together to keep the original word sequence unchanged;
3) judging whether the self-defined rule is updated or not, and extracting special medical terms of the rule under the condition of ensuring that the rule is the latest rule;
4) recognizing the medical text entity by adopting a BERT pre-training semantic model;
5) the way of combining 3) and 4) is adopted to keep more semantics as much as possible.
Optionally, the semantic extraction of the medical text entity by using the attention-based information extraction method includes:
1) inputting the medical text entity into CNN to obtain a word vector representation, and inputting the word vector representation into two layers of HighIn wayNet, a vector representation is obtained
Figure BDA0002686130150000041
Wherein wiIs the ith word in the medical text entity, wi={c1,c2,...,cnIn which c iskIs the word wiThe k character in (a);
2) bidirectional context coding is carried out on the word vector representation in the medical text entity by utilizing a context coding layer:
H=BiLSTM(C)
U=BiLSTM(C)
wherein:
h, U are context coding results of the two times respectively;
c is a word vector representation of the medical text entity;
3) computing a similarity matrix S of the context coding result:
S=sim(H:t,U:j)
wherein:
H:tthe t column vector of H;
U:ja jth column vector of U;
sim is a cosine similarity measurement formula;
4) calculating an attention weight vector G for the medical text entity:
G=softmax(S:t)
wherein:
S:tis the t-th column of the similarity matrix S;
5) and (3) outputting semantic features of the medical text by using a BilSTM model:
M=BiLSTM(G)
wherein:
and M is a semantic feature of the medical text.
Optionally, the understanding of the medical text by using the multi-layer perceptron comprises:
the medical text understanding is carried out by utilizing a multilayer perceptron, the medical text understanding result y with the highest probability is used as the output of the multilayer perceptron, and the specific process is as follows:
P(y|M)=σ(MLP(M))
wherein:
m is semantic characteristics of medical text entities;
y is the medical text understanding result;
sigma is sigmoid function;
MLP is a perceptron consisting of two layers of linear transformations and a non-linear ReLu activation function;
and (3) using the cross entropy as a loss function of the multi-layer perceptron to train the model, wherein the method comprises the following steps:
Figure BDA0002686130150000051
wherein:
n is the total number of training samples;
during the training process, the model optimizes the entire model using a stochastic gradient descent optimizer.
In addition, to achieve the above object, the present invention also provides a medical text understanding system based on a BERT model, the system comprising:
medical text generation means for generating large-scale medical text data using a medical text generation model based on a text copy;
the medical text processor is used for training a medical text entity recognition model by utilizing the generated large-scale medical field text data and carrying out entity recognition on a medical text to be processed by utilizing the trained medical text entity recognition model; meanwhile, semantic extraction is carried out on the medical text entity by using an attention-based information extraction method to obtain semantic features of the medical text entity;
and the medical text understanding device is used for understanding the medical text by utilizing the multilayer sensing machine.
In addition, to achieve the above object, the present invention also provides a computer readable storage medium having stored thereon medical text understanding program instructions executable by one or more processors to implement the steps of the implementation method of medical text understanding based on BERT model as described above.
Compared with the prior art, the invention provides a BERT model-based medical text understanding method, which has the following advantages:
firstly, because the existing text understanding research about the medical field is less, and the proper term of the medical field data has stronger speciality and high labeling cost, the accurate labeling data is less, and a large-scale text data set in the medical field is lacked. Therefore, the invention provides a medical text generation model based on text copy for generating medical text data, and introduces a hidden variable ztTo control the model to generate from the vocabulary or copy from the text the word currently needed to be generated during the decoding process, in which z is usually usedtSet to 0, indicating that the decoder is to generate a word from the vocabulary at the current time, when a medical domain specific noun is encountered, then z will betThe method is set to be 1, and represents that a decoder copies a word from an input text D at the current moment, so that a copying mechanism is introduced, therefore, when a model generates large-scale medical text data, certain medical field special words in the input text D can be directly copied into the generated medical text data, the situation that the generation of sparse words such as special names is difficult to realize during generation is relieved, a large-scale medical text data set is obtained, and the understanding of subsequent medical texts is carried out.
Meanwhile, because the traditional rule-based method is poor in robustness and portability, the invention provides an information extraction processing method based on semantic fusion of rule and named entity recognition; firstly, aiming at the traditional word segmentation method based on the jieba tool, the invention uses smaller characters to replace segmented words with higher word frequency by calculating the word frequency of each word in the word segmentation result on the basis of the jieba segmentation, for example, if the word frequency of one word of a doctor in the jieba segmentation result is higher, the word of the doctor is replaced by using a character Y, thereby realizing the segmentation of the word segmentation result into smaller units, effectively reducing the size of a word segmentation dictionary, introducing a word boundary symbol _, combining the word boundary symbol with a plurality of divided words together, keeping the original word sequence unchanged, and leading the algorithm to recover the original text without ambiguity; and aiming at terms in the medical treatment direction, the effect of extracting real information is influenced by the fact that some doctors express non-standard or new medical vocabulary and other problems can occur, in order to reduce the influence on the aspect, the invention adds an operation of updating rules in the information extraction method, wherein the rules refer to regular expressions, mapping tables and other patch files, and some simple rules for customizing medical names can be automatically added, so that the medical term information is effectively extracted.
Drawings
Fig. 1 is a schematic flowchart of a medical text understanding method based on a BERT model according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a medical text understanding system based on a BERT model according to an embodiment of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Generating large-scale medical field text data by using a medical text generation technology based on text copy, and training a medical text entity recognition model by using the generated medical field text data, so as to perform entity recognition on a medical text to be processed by using the trained medical text entity recognition model; and semantic extraction is carried out on the medical text entity by using an information extraction method based on rules, and the understanding of the medical text is realized according to the extracted semantic information. Referring to fig. 1, a schematic diagram of a medical text understanding method based on a BERT model according to an embodiment of the present invention is provided.
In this embodiment, the medical text understanding method based on the BERT model includes:
s1, medical text data are obtained, invalid medical text data are filtered out through a sentence filtering model, and large-scale medical text data are generated through a medical text generation model based on text copying.
Firstly, acquiring a large amount of medical text data, and filtering invalid medical text data by using a sentence filtering model;
the sentence filtering model is a BERT-based self-attention mechanism model; the process of filtering invalid medical text data by using the sentence filtering model comprises the following steps:
1) increasing [ CLS ] before inputting word sequence]Tagging, added after entry of word sequences [ SEP]Marking, converting the input word sequence into corresponding Token Embedding, and calculating to obtain Position Embedding corresponding to each word; adding the two Embedding corresponding to each word to obtain an input Embedding code; formally, for an input sequence S ═ S1,s2,...,snAnd constructing an input sequence of BERT as { [ CLS { [],s1s2...sn,[SEP]The corresponding output is noted as
Figure BDA0002686130150000072
Wherein s isiFor the ith word in a piece of medical text data,
Figure BDA0002686130150000073
for corresponding BERT word vectors, T[CLS],T[SEP]Is [ CLS ]]And [ SEP ]]Marking the corresponding BERT word vector;
2) the attention weight α of the input sequence vector is derived using a global-based attention matrix:
α=softmax(WT)
wherein:
w is a weight-based attention matrix used to assist the model in capturing the more important information for classification in the representation of the input sequence;
t is a BERT word vector;
3) multiplying the attention weight by the BERT word vector representation obtained by the corresponding word vector coding layer to obtain the attention representation of the input sequence:
Figure BDA0002686130150000071
wherein:
Tirepresenting the ith BERT word vector;
αian attention weight representing the ith BERT word vector;
4) outputting sentence filtering results based on the parameter matrix of the multilayer perceptron:
Output=sigmoid(W0attention)
wherein:
W0a parameter matrix of a multi-layer perceptron;
furthermore, as for the sentence filtering result, the bidirectional LSTM model is used for coding, in the decoding and medical text generation stages, medical nouns are sparse in the corpus, and the model is difficult to learn and generate the proper nouns, so that in a specific embodiment of the invention, by introducing a copy mechanism, the invention allows the model to directly copy some words in the input text D to the generated problem when some words in the problem are generated, thereby relieving the situation that the sparse words such as proper names are difficult to generate when the sparse words are generated;
in detail, the medical text generation process of the medical text generation model based on text copy comprises the following steps:
1) introducing an implicit variable ztControlling the model to generate words from the word list or copy the words needed to be generated currently from the text in the decoding process when z ist0 represents that the decoder needs to generate a word from the word list at the current moment, when z t1 represents that the decoder copies a word from the input text D at the present moment;
2) generating the medical text by using a decoder, wherein the probability of generating the tth word by the decoder is as follows:
Figure BDA0002686130150000081
wherein:
d is a text of a sentence filtering result;
s is a text word vector;
ytgenerating a t-th word;
ztis a hidden variable used for controlling the model to generate words from a word list or copy words which are required to be generated currently from a text in the decoding process.
And S2, training a medical text entity recognition model by using the generated large-scale medical field text data, and performing entity recognition on the medical text to be processed by using the trained medical text entity recognition model.
Further, the invention utilizes the generated large-scale text data in the medical field to train the medical text entity recognition model, and the training of the medical text entity model is divided into the following two parts:
1) marking [ MASK ] on the input token by using a bidirectional masked language model and adopting a random marking [ MASK ] method, and predicting the token marked with [ MASK ] by using context;
2) randomly selecting two sentences from the medical text data, and if the [ MASK ] marks of the two sentences are marked as context marks, considering that one sentence is the next sentence of the other sentence;
3) the above steps are repeated until 30% of the medical text data is marked [ MASK ]. The process of performing entity recognition on the medical text by using the medical text entity recognition model comprises the following steps:
1) performing word segmentation processing on the text data in the large-scale medical field by adopting a jieba word segmentation tool;
2) by calculating the word frequency of each word in the word segmentation result, replacing the segmented words with higher word frequency by using smaller characters, introducing a word boundary symbol, and combining the word boundary symbol with a plurality of divided word groups, the original word sequence is kept unchanged;
3) judging whether the self-defined rule is updated or not, and extracting special medical terms of the rule under the condition of ensuring that the rule is the latest rule;
4) recognizing the medical text entity by adopting a BERT pre-training semantic model;
5) the way of combining 3) and 4) is adopted to keep more semantics as much as possible.
And S3, performing semantic extraction on the medical text entity by using an attention-based information extraction method to obtain semantic features of the medical text entity.
Furthermore, the invention utilizes an attention-based information extraction method to perform semantic extraction on the medical text entity to obtain the semantic features of the medical text entity, and the extraction process of the semantic features of the medical text entity comprises the following steps:
1) inputting the medical text entity into CNN to obtain a word vector representation, and inputting the word vector representation into a two-layer high way Network to obtain a vector representation
Figure BDA0002686130150000091
Wherein wiIs the ith word in the medical text entity, wi={c1,c2,...,cnIn which c iskIs the word wiThe k character in (a);
2) bidirectional context coding is carried out on the word vector representation in the medical text entity by utilizing a context coding layer:
H=BiLSTM(C)
U=BiLSTM(C)
wherein:
h, U are context coding results of the two times respectively;
c is a word vector representation of the medical text entity;
3) computing a similarity matrix S of the context coding result:
S=sim(H:t,U:j)
wherein:
H:tthe t column vector of H;
U:ja jth column vector of U;
sim is a cosine similarity measurement formula;
4) calculating an attention weight vector G for the medical text entity:
G=softmax(S:t)
wherein:
S:tis the t-th column of the similarity matrix S;
5) and (3) outputting semantic features of the medical text by using a BilSTM model:
M=BiLSTM(G)
wherein:
and M is a semantic feature of the medical text.
And S4, understanding the medical text by utilizing a multilayer perceptron according to the semantic features of the medical text entity.
Further, according to the semantic feature M of the medical text entity, the invention utilizes a multilayer perceptron to understand the medical text, and uses the medical text understanding result y with the highest probability as the output of the multilayer perceptron, and the specific process is as follows:
P(y|M)=σ(MLP(M))
wherein:
m is semantic characteristics of medical text entities;
y is the medical text understanding result;
sigma is sigmoid function;
MLP is a perceptron consisting of two layers of linear transformations and a non-linear ReLu activation function;
further, the invention uses the cross entropy as the loss function of the multi-layer perceptron, which is as follows:
Figure BDA0002686130150000101
wherein:
n is the total number of training samples;
during the training process, the model optimizes the entire model using a stochastic gradient descent optimizer. The initial learning rate of the model is 0.005, and the model is gradually halved along with the training process, so that the training effect of the model is ensured.
The following describes embodiments of the present invention through an algorithmic experiment and tests of the inventive treatment method. The hardware test environment of the algorithm of the invention is as follows: the system is Ubuntu16.04, the open source framework is TensorFlow 1.6, the processor is Intel i7-7700K, and the graphics card is Nvidia GTX 1080-Ti; the comparison algorithm models are BiLSTM, BERT and CRF-LSTM models.
In the algorithm experiment of the invention, the data set is c MedQA2, and the data set is a Chinese medical text data set in large-scale medical field. In the experiment, medical text data in a data set is input into an algorithm model, and the accuracy of understanding the medical text is used as an evaluation index of algorithm performance.
According to the experimental result, the medical text understanding accuracy of the BilSTM model is 95.62%, the medical text understanding accuracy of the BERT model is 92.14%, the medical text understanding accuracy of the CRF-LSTM model is 93.18%, and the medical text understanding accuracy of the BERT model-based medical text understanding algorithm of the invention is 96.82%.
The invention also provides a BERT model-based medical text understanding system. Referring to fig. 2, a schematic diagram of an internal structure of a BERT model-based medical text understanding system according to an embodiment of the present invention is provided.
In the present embodiment, the BERT model-based medical text understanding system 1 includes at least a medical text generating means 11, a medical text processor 12, a medical text understanding means 13, a communication bus 14, and a network interface 15.
The medical text generation device 11 may be a PC (Personal Computer), a terminal device such as a smartphone, a tablet Computer, or a mobile Computer, or may be a server.
The medical text processor 12 includes at least one type of readable storage medium including flash memory, hard disk, multi-media card, card type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, and the like. The medical text processor 12 may in some embodiments be an internal storage unit of the BERT model based medical text understanding system 1, for example a hard disk of the BERT model based medical text understanding system 1. The medical text processor 12 may also be an external storage device of the BERT model-based medical text understanding system 1 in other embodiments, such as a plug-in hard disk provided on the BERT model-based medical text understanding system 1, a Smart Memory Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the medical text processor 12 may also include both an internal storage unit and an external storage device of the BERT model-based medical text understanding system 1. The medical text processor 12 can be used not only to store application software installed in the BERT model-based medical text understanding system 1 and various kinds of data, but also to temporarily store data that has been output or is to be output.
Medical text understanding apparatus 13 may be, in some embodiments, a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data processing chip for executing program code stored in medical text processor 12 or processing data, such as medical text understanding program instructions.
The communication bus 14 is used to enable connection communication between these components.
The network interface 15 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), and is typically used to establish a communication link between the system 1 and other electronic devices.
Optionally, the system 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the BERT model-based medical text understanding system 1 and for displaying a visualized user interface.
Fig. 2 only shows the medical text understanding system 1 with the components 11-15 and based on the BERT model, and it will be understood by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation of the BERT model based medical text understanding system 1, and may include fewer or more components than shown, or combine certain components, or a different arrangement of components.
In the embodiment of the apparatus 1 shown in fig. 2, medical text understanding program instructions are stored in the medical text processor 12; the steps of the medical text understanding apparatus 13 executing the medical text understanding program instructions stored in the medical text processor 12 are the same as the implementation method of the BERT model-based medical text understanding method, and are not described here.
Furthermore, an embodiment of the present invention also provides a computer-readable storage medium having stored thereon BERT model-based medical text understanding program instructions, which are executable by one or more processors to implement the following operations:
acquiring medical text data, and filtering invalid medical text data by using a sentence filtering model;
according to the filtered medical text data, generating large-scale medical text data by using a medical text generation model based on text copy;
training a medical text entity recognition model by using the generated large-scale medical field text data;
performing entity recognition on the medical text to be processed by using the trained medical text entity recognition model;
semantic extraction is carried out on the medical text entity by using an attention-based information extraction method to obtain semantic features of the medical text entity;
and according to the semantic features of the medical text entities, understanding the medical text by using a multilayer perception machine.
It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (9)

1. A BERT model-based medical text understanding method, the method comprising:
acquiring medical text data, and filtering invalid medical text data by using a sentence filtering model;
according to the filtered medical text data, generating large-scale medical text data by using a medical text generation model based on text copy;
training a medical text entity recognition model by using the generated large-scale medical field text data;
performing entity recognition on the medical text to be processed by using the trained medical text entity recognition model;
semantic extraction is carried out on the medical text entity by using an attention-based information extraction method to obtain semantic features of the medical text entity;
and according to the semantic features of the medical text entities, understanding the medical text by using a multilayer perception machine.
2. The BERT model-based medical text understanding method of claim 1, wherein the filtering out invalid medical text data using the sentence filtering model comprises:
the sentence filtering model is a BERT-based self-attention mechanism model; the process of filtering invalid medical text data by using the sentence filtering model comprises the following steps:
1) adding a [ CLS ] mark before inputting a word sequence, adding a [ SEP ] mark after inputting the word sequence, converting the input word sequence into corresponding Token Embedding, and calculating to obtain Position Embedding corresponding to each word; adding the two Embedding corresponding to each word to obtain an input Embedding code;
2) the attention weight α of the input sequence vector is derived using a global-based attention matrix:
α=softmax(WT)
wherein:
w is a weight-based attention matrix used to assist the model in capturing the more important information for classification in the representation of the input sequence;
t is a BERT word vector;
3) multiplying the attention weight by the BERT word vector representation obtained by the corresponding word vector coding layer to obtain the attention representation of the input sequence:
Figure FDA0002686130140000011
wherein:
Tirepresenting the ith BERT word vector;
αian attention weight representing the ith BERT word vector;
4) outputting sentence filtering results based on the parameter matrix of the multilayer perceptron:
Output=sigmoid(W0attention)
wherein:
W0is a parameter matrix of the multi-layer perceptron.
3. The BERT model-based medical text understanding method of claim 2, wherein the large-scale medical text data generation using a text copy-based medical text generation model comprises:
1) introducing an implicit variable ztControlling the model to generate words from the word list or copy the words needed to be generated currently from the text in the decoding process when z ist0 represents that the decoder needs to generate a word from the word list at the current moment, when zt1 represents that the decoder copies a word from the input text D at the present moment;
2) generating the medical text by using a decoder, wherein the probability of generating the tth word by the decoder is as follows:
Figure FDA0002686130140000021
wherein:
d is a text of a sentence filtering result;
s is a text word vector;
ytgenerating a t-th word;
ztis an implicit variable when zt0 represents that the decoder needs to generate a word from the word list at the current moment, when zt1 represents that the decoder copies a word from the input text D at the current moment.
4. The method for understanding medical texts based on the BERT model as claimed in claim 3, wherein the training process of the medical text entity recognition model by using the generated large-scale medical field text data comprises:
1) marking [ MASK ] on the input token by using a bidirectional masked language model and adopting a random marking [ MASK ] method, and predicting the token marked with [ MASK ] by using context;
2) randomly selecting two sentences from the medical text data, and if the [ MASK ] marks of the two sentences are marked as context marks, considering that one sentence is the next sentence of the other sentence;
3) the above steps are repeated until 30% of the medical text data is marked [ MASK ].
5. The method of claim 4, wherein the entity recognizing the medical text by the trained medical text entity recognition model comprises:
1) performing word segmentation processing on the text data in the large-scale medical field by adopting a jieba word segmentation tool;
2) by calculating the word frequency of each word in the word segmentation result, replacing the segmented words with higher word frequency by using smaller characters, introducing a word boundary symbol, and combining the word boundary symbol with a plurality of divided word groups together to keep the original word sequence unchanged;
3) judging whether the self-defined rule is updated or not, and extracting special medical terms of the rule under the condition of ensuring that the rule is the latest rule;
4) recognizing the medical text entity by adopting a BERT pre-training semantic model;
5) and 3) and 4) adopt a union mode to reserve more semantics.
6. The BERT model-based medical text understanding method of claim 5, wherein the semantic extraction of the medical text entity using the attention-based information extraction method comprises:
1) inputting the medical text entity into CNN to obtain a word vector representation, and inputting the word vector representation into two layers of HighwayNetwork to obtain a vector representation
Figure FDA0002686130140000031
Wherein wiIs the ith word in the medical text entity, wi={c1,c2,...,cnIn which c iskIs the word wiThe k character in (a);
2) bidirectional context coding is carried out on the word vector representation in the medical text entity by utilizing a context coding layer:
H=BiLSTM(C)
U=BiLSTM(C)
wherein:
h, U are context coding results of the two times respectively;
c is a word vector representation of the medical text entity;
3) computing a similarity matrix S of the context coding result:
S=sim(H:t,U:j)
wherein:
H:tthe t column vector of H;
U:ja jth column vector of U;
sim is a cosine similarity measurement formula;
4) calculating an attention weight vector G for the medical text entity:
G=softmax(S:t)
wherein:
S:tis the t-th column of the similarity matrix S;
5) and (3) outputting semantic features of the medical text by using a BilSTM model:
M=BiLSTM(G)
wherein:
and M is a semantic feature of the medical text.
7. The method as claimed in claim 6, wherein said medical text understanding using multilayer perceptron comprises:
the medical text understanding is carried out by utilizing a multilayer perceptron, the medical text understanding result y with the highest probability is used as the output of the multilayer perceptron, and the specific process is as follows:
P(y|M)=σ(MLP(M))
wherein:
m is semantic characteristics of medical text entities;
y is the medical text understanding result;
sigma is sigmoid function;
MLP is a perceptron consisting of two layers of linear transformations and a non-linear ReLu activation function;
and (3) using the cross entropy as a loss function of the multi-layer perceptron to train the model, wherein the method comprises the following steps:
Figure FDA0002686130140000041
wherein:
n is the total number of training samples;
during the training process, the model optimizes the entire model using a stochastic gradient descent optimizer.
8. A BERT model-based medical text understanding system, the system comprising:
medical text generation means for generating large-scale medical text data using a medical text generation model based on a text copy;
the medical text processor is used for training a medical text entity recognition model by utilizing the generated large-scale medical field text data and carrying out entity recognition on a medical text to be processed by utilizing the trained medical text entity recognition model; meanwhile, semantic extraction is carried out on the medical text entity by using an attention-based information extraction method to obtain semantic features of the medical text entity;
and the medical text understanding device is used for understanding the medical text by utilizing the multilayer sensing machine.
9. A computer readable storage medium having stored thereon medical text understanding program instructions executable by one or more processors to implement the steps of a method of implementing a BERT model-based medical text understanding of any of claims 1 to 7.
CN202010977191.2A 2020-09-17 2020-09-17 Medical text understanding method and system based on BERT model Withdrawn CN112016314A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010977191.2A CN112016314A (en) 2020-09-17 2020-09-17 Medical text understanding method and system based on BERT model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010977191.2A CN112016314A (en) 2020-09-17 2020-09-17 Medical text understanding method and system based on BERT model

Publications (1)

Publication Number Publication Date
CN112016314A true CN112016314A (en) 2020-12-01

Family

ID=73522427

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010977191.2A Withdrawn CN112016314A (en) 2020-09-17 2020-09-17 Medical text understanding method and system based on BERT model

Country Status (1)

Country Link
CN (1) CN112016314A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112667808A (en) * 2020-12-23 2021-04-16 沈阳新松机器人自动化股份有限公司 BERT model-based relationship extraction method and system
CN112686044A (en) * 2021-01-18 2021-04-20 华东理工大学 Medical entity zero sample classification method based on language model
CN112712804A (en) * 2020-12-23 2021-04-27 哈尔滨工业大学(威海) Speech recognition method, system, medium, computer device, terminal and application
CN112784018A (en) * 2021-01-28 2021-05-11 新华智云科技有限公司 Text similarity entity disambiguation method and system for character entity library
CN113033210A (en) * 2021-05-31 2021-06-25 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Drug potential side effect mining method based on social media data analysis
CN114417856A (en) * 2021-12-29 2022-04-29 北京百度网讯科技有限公司 Text sparse coding method and device and electronic equipment
CN116364055A (en) * 2023-05-31 2023-06-30 中国科学院自动化研究所 Speech generation method, device, equipment and medium based on pre-training language model

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112667808A (en) * 2020-12-23 2021-04-16 沈阳新松机器人自动化股份有限公司 BERT model-based relationship extraction method and system
CN112712804A (en) * 2020-12-23 2021-04-27 哈尔滨工业大学(威海) Speech recognition method, system, medium, computer device, terminal and application
CN112686044A (en) * 2021-01-18 2021-04-20 华东理工大学 Medical entity zero sample classification method based on language model
CN112784018A (en) * 2021-01-28 2021-05-11 新华智云科技有限公司 Text similarity entity disambiguation method and system for character entity library
CN113033210A (en) * 2021-05-31 2021-06-25 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Drug potential side effect mining method based on social media data analysis
CN114417856A (en) * 2021-12-29 2022-04-29 北京百度网讯科技有限公司 Text sparse coding method and device and electronic equipment
CN116364055A (en) * 2023-05-31 2023-06-30 中国科学院自动化研究所 Speech generation method, device, equipment and medium based on pre-training language model
CN116364055B (en) * 2023-05-31 2023-09-01 中国科学院自动化研究所 Speech generation method, device, equipment and medium based on pre-training language model

Similar Documents

Publication Publication Date Title
CN112016314A (en) Medical text understanding method and system based on BERT model
CN112101041B (en) Entity relationship extraction method, device, equipment and medium based on semantic similarity
CN112784578B (en) Legal element extraction method and device and electronic equipment
CN110688854B (en) Named entity recognition method, device and computer readable storage medium
WO2021135469A1 (en) Machine learning-based information extraction method, apparatus, computer device, and medium
CN110866098B (en) Machine reading method and device based on transformer and lstm and readable storage medium
CN113051356A (en) Open relationship extraction method and device, electronic equipment and storage medium
CN114580424B (en) Labeling method and device for named entity identification of legal document
CN108959566A (en) A kind of medical text based on Stacking integrated study goes privacy methods and system
CN114822812A (en) Character dialogue simulation method, device, equipment and storage medium
CN114626097A (en) Desensitization method, desensitization device, electronic apparatus, and storage medium
CN113704429A (en) Semi-supervised learning-based intention identification method, device, equipment and medium
CN113360654B (en) Text classification method, apparatus, electronic device and readable storage medium
CN113268576B (en) Deep learning-based department semantic information extraction method and device
CN111783471A (en) Semantic recognition method, device, equipment and storage medium of natural language
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
CN114416995A (en) Information recommendation method, device and equipment
JP2022145623A (en) Method and device for presenting hint information and computer program
CN112000778A (en) Natural language processing method, device and system based on semantic recognition
CN112328655A (en) Text label mining method, device, equipment and storage medium
CN114399775A (en) Document title generation method, device, equipment and storage medium
Shekar et al. Optical character recognition and neural machine translation using deep learning techniques
CN115238115A (en) Image retrieval method, device and equipment based on Chinese data and storage medium
CN115510188A (en) Text keyword association method, device, equipment and storage medium
Wang et al. Chinese-braille translation based on braille corpus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20201201