CN112861519A - Medical text error correction method, device and storage medium - Google Patents

Medical text error correction method, device and storage medium Download PDF

Info

Publication number
CN112861519A
CN112861519A CN202110264865.9A CN202110264865A CN112861519A CN 112861519 A CN112861519 A CN 112861519A CN 202110264865 A CN202110264865 A CN 202110264865A CN 112861519 A CN112861519 A CN 112861519A
Authority
CN
China
Prior art keywords
medical text
corrected
medical
text
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110264865.9A
Other languages
Chinese (zh)
Inventor
王亦宁
刘升平
梁家恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd, Xiamen Yunzhixin Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202110264865.9A priority Critical patent/CN112861519A/en
Publication of CN112861519A publication Critical patent/CN112861519A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation

Abstract

The invention relates to a medical text error correction method, a device and a storage medium, wherein the medical text error correction method comprises the following steps: establishing a pre-trained language model BERT in the medical fieldbio(ii) a Correcting the pre-training language model BERTbioObtaining a medical text to be corrected; and correcting the medical text to be corrected. The medical text error correction method can well process the situations of wrong words, missing words or multiple words, does not need to consume manpower to label a confusion set dictionary in the medical field, reduces the manpower cost and improves the coverage rate and the applicability of medical text error correction.

Description

Medical text error correction method, device and storage medium
Technical Field
The invention relates to the field of computers, in particular to a medical text error correction method, a medical text error correction device and a storage medium.
Background
Compared with the natural language text in the general field, the medical text in the medical field includes more professional words and transliterated words, such as the compound ketoconazole ointment, and when missing words, multiple words or wrong words occur, the text recognition system is difficult to understand the intention of the user or misunderstand the intention of the user, and further difficult to feed back the expected result to the user.
Therefore, in order to correctly understand the intention of the user, the medical text usually needs to be corrected, and the current model for correcting the medical text needs to perform word segmentation on the medical text, then search a confusion set dictionary for terms in the medical text, construct a candidate medical text set, speculate an error correction text according to the probability, and finally correct the error correction text.
However, the existing word segmentation model has low accuracy, and needs a professional with certain experience to collect and construct a large-scale confusion set dictionary, which is time-consuming and labor-consuming.
Disclosure of Invention
The invention relates to a medical text error correction method, a medical text error correction device and a storage medium, which can solve the technical problem of poor word segmentation accuracy of an error correction model in the medical field.
The technical scheme for solving the technical problems is as follows:
in a first aspect, an embodiment of the present application provides a medical text correction method, where the medical text correction method includes:
establishing a pre-trained language model BERT in the medical fieldbio
Correcting the pre-training language model BERTbioObtaining a medical text to be corrected;
and correcting the medical text to be corrected.
Optionally, the pre-training language model BERT in the medical field is establishedbioThe method comprises the following steps:
acquiring a first medical text;
identifying and acquiring label-free data R in the first medical textnThe label-free data R is addednAs the second medical-treatment text, there is,
Rn=[s1,s2...si...sn] (1)
wherein s ═ w0,w1...wi...wn]S represents each text of the second medical text, w represents each word/character of the second medical text;
training the pre-training language model BERT with the second medical textbioSaid pre-trained language model BERTbioThe training target of (a) is P,
P=(wi|w0...wi,wi+1...wn) (2)
wherein i is more than or equal to 0 and less than or equal to n, and n is a natural number.
Optionally, the correcting the pre-training language model BERTbioObtaining the medical text to be corrected, including:
establishing a classification model;
predicting a probability distribution Prob of the second medical text by the classification model;
and screening the medical text to be corrected according to the probability distribution Prob.
Optionally, the establishing a classification model includes:
defining a first input sequence XnAnd in said first input sequence XnSource end add tag [ CLS ]],
Xn=[x0,x1...xi...xn] (3)
The first input sequence X to be taggednPre-trained language model BERTbioTo obtain a first input vector E,
E=[e0,e1,e2...ei...en] (4)
wherein e isiA first input vector representing an ith word/character of the second medical text;
encoding Trm (e) for each word/character in the second medical texti),
Figure BDA0002972001690000031
Wherein the content of the first and second substances,
Figure BDA0002972001690000032
a hidden layer vector representing an ith word/character of an nth layer in the second medical text,
i is more than or equal to 0 and less than or equal to n, and n is a natural number.
Optionally, the predicting, by the classification model, the probability distribution Prob of the second medical text includes:
obtaining a first hidden layer vector of the nth layer in the second medical text
Figure BDA00029720016900000311
Figure BDA0002972001690000033
The first word/character in the n hidden layer vector is used for encoding the first word/character in the n hidden layer vector
Figure BDA0002972001690000034
The linear transformation C is carried out, and the linear transformation C is carried out,
Figure BDA0002972001690000035
predicting a probability distribution Prob of the second medical text,
Prob=softmax(C) (8)
wherein the content of the first and second substances,
Figure BDA0002972001690000036
a hidden layer vector representing a first character of an nth layer in the second medical text.
Optionally, the correcting the medical text to be corrected includes:
self for encoding the medical text to be correctedenc
Figure BDA0002972001690000037
Wherein the content of the first and second substances,
Figure BDA0002972001690000038
representing a code SelfencThe hidden layer of the ith character/character of the nth layer of the medical text to be correctedVector, viRepresenting a code SelfencInputting an ith word/character input vector of the medical text to be corrected;
self for decoding coded medical text to be correcteddec
Figure BDA0002972001690000039
Wherein the content of the first and second substances,
Figure BDA00029720016900000310
representing decoding SelfdecThe hidden layer vector u of the ith word/character of the nth layer of the medical text to be correctediRepresenting a code SelfencThe input vector of the ith word/character of the medical text to be corrected, hNRepresenting a code SelfencThe hidden state of the nth layer of the medical text to be corrected;
and predicting the probability distribution of the text to be corrected to obtain the corrected medical text.
Optionally, the Self for encoding the medical text to be corrected is describedencThe method comprises the following steps:
defining a second input sequence Ln
Ln=[l0,l1...li...ln] (11)
The second input sequence LnPre-trained language model BERTbioTo obtain a second input vector V,
V=[v0,v1,v2...vi...vn] (12)
wherein v isiRepresenting a code SelfencThe input vector of the ith word/character of the medical text to be corrected.
Optionally, the Self is used for decoding the encoded medical text to be correcteddecThe method comprises the following steps:
defining a third input sequence Yn
Yn=[y0,y1...yi...yn] (13)
Inputting the third input sequence YnPre-trained language model BERTbioTo obtain a third input vector U,
U=[u0,u1,u2...ui...un] (14)
wherein u isiRepresenting decoding SelfdecThe input vector of the ith word/character of the medical text to be corrected.
Optionally, the predicting the probability distribution of the text to be corrected to obtain the corrected medical text includes:
obtaining decoded SelfdecThe hidden state f of the nth layer of the medical text to be correctedN
Will decode SelfdecThe hidden state f of the nth layer of the medical text to be correctedNMaking linear transformations
Figure BDA0002972001690000041
Figure BDA0002972001690000042
Wherein f isNRepresenting decoding SelfdecThe hidden layer vectors of all words/characters of the nth layer of the medical text to be corrected,
Figure BDA0002972001690000043
linear transformation representing ith word/character of nth layer;
predicting a probability distribution of each word/character of the medical text to be corrected
Figure BDA0002972001690000044
Figure BDA0002972001690000045
Wherein, WiAnd biIs a parameter of the probability distribution;
calculating the maximum probability z of each word/character of the medical text to be correctedi
Figure BDA0002972001690000051
Acquiring a corrected medical text Z according to the maximum probability of each word/character of the medical text to be corrected,
Z=[z1,z2...zi...zn] (18)
wherein i is more than or equal to 0 and less than or equal to n, and n is a natural number.
In a second aspect, an embodiment of the present application provides an error correction medical text device, including:
a training unit for establishing a pre-training language model BERT in the medical fieldbio
A processing unit for correcting the pre-trained language model BERTbioObtaining a medical text to be corrected;
and the correcting unit is used for correcting the medical text to be corrected.
Third aspect embodiments provide a corrected medical text apparatus comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs including instructions for performing the steps of the corrected medical text method of the first aspect.
Fourth aspect the present application provides a computer-readable storage medium storing a computer program for execution by a processor to implement the method for correcting medical text as described in the first aspect above.
Any one of the embodiments of the invention described above has the following advantages or benefits:
in the inventionIn an embodiment, a pre-trained language model BERT in the medical field is establishedbioSaid pre-trained language model BERTbioRelying on relatively easily accessible large-scale medical text (initial corpus) and pre-trained language model BERT refined using large amounts of external medical databioMaking it more accurate than a language model that uses only the initial corpus. Thus, by correcting the pre-trained language model BERTbioThe method for obtaining the medical text to be corrected and further correcting the medical text to be corrected improves the coverage rate and the applicability of text correction of the medical text. In addition, because the method takes the characters as the minimum processing unit, the fineness is improved, and the problem that the existing word segmentation model is generally low in quality is solved.
Drawings
Fig. 1 is a flowchart of a medical text error correction method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a medical text error correction method according to an embodiment of the present invention;
fig. 3 is another schematic diagram of a medical text error correction method according to an embodiment of the present invention;
fig. 4 is another flowchart of a medical text error correction method according to an embodiment of the present invention;
fig. 5 is another flowchart of a medical text error correction method according to an embodiment of the present invention;
fig. 6 is another flowchart of a medical text correction method according to an embodiment of the present invention.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings. The following examples are provided only for explaining the method features, flow steps or principle features of the present invention, and are not intended to limit the scope of the present invention.
It should be noted that, according to the technical solution provided in the embodiment of the present application, an execution subject of each step may be a computer device, and the computer device may be a terminal device such as a smart phone, a tablet computer, and a personal computer, or may be a server. The server may be one server, a server cluster formed by a plurality of servers, or a cloud computing service center, and the present invention is not limited to this.
A medical text error correction method provided in an embodiment of the present invention is, as shown in fig. 1, a flowchart of the medical text error correction method provided in an embodiment of the present invention, and the medical text error correction method includes:
s10, establishing a pre-training language model BERT in the medical fieldbio
In the step, the initial corpus is a medical text which is easy to obtain, and the medical text is finely adjusted to obtain a pre-training language model BERTbio
S20, correcting the pre-training language model BERTbioAnd obtaining the medical text to be corrected.
And S30, correcting the medical text to be corrected.
In order to make those skilled in the art understand the present invention better, the principle of the medical text error correction method is briefly described, as shown in fig. 2 and fig. 3, where fig. 2 is a schematic diagram of the medical text error correction method provided by the embodiment of the present invention, and fig. 3 is another schematic diagram of the medical text error correction method provided by the embodiment of the present invention. In the first stage of the application, a pretrained language model BERT is obtained by finely adjusting a large amount of external medical databioIn the second stage, it is understood that a discriminant model is established, as shown in fig. 2, the second medical text screened in the first stage is identified, and the medical text to be corrected is identified. The third stage, which may be understood as building a correction model, as shown in fig. 3, uses a sequence-to-sequence method based on a pointer network to perform the result correction on the text (which may be understood as the medical text to be corrected) identified as erroneous in the second stage. By the identification of the second stage and the correction of the third stage, the embodiment can accurately judge the error in the medical text and carefully correct the error.
In the embodiment of the invention, a pre-training language model BERT in the medical field is establishedbioSaidPre-training language model BERTbioRelying on relatively easily accessible large-scale medical text (initial corpus) and pre-trained language model BERT refined using large amounts of external medical databioMaking it more accurate than a language model that uses only the initial corpus. Thus, by correcting the pre-trained language model BERTbioThe method for obtaining the medical text to be corrected and further correcting the medical text to be corrected improves the coverage rate and the applicability of text correction of the medical text. And because the method takes the characters as the minimum processing unit, the fineness is improved, and compared with the prior art, the method does not need to carry out word segmentation, and avoids the problem that the quality of the current word segmentation model is generally low.
The following explains the above steps in detail:
exemplarily, as shown in fig. 4, it is another flowchart of a medical text error correction method provided by an embodiment of the present invention, which establishes a pre-training language model BERT in the medical fieldbioThe method comprises the following steps:
s101, acquiring a first medical text;
s102, identifying and acquiring the label-free data R in the first medical textnThe label-free data R is addednAs the second medical-treatment text, there is,
Rn=[51,s2...si...sn] (1)
wherein s ═ w0,w1…wi…wn]S represents each text of the second medical text, w represents each word/character of the second medical text;
training the pre-training language model BERT with the second medical textbioSaid pre-trained language model BERTbioThe training target of (a) is P,
P=(wi|w0...wi,wi+1...wn) (2)
wherein i is more than or equal to 0 and less than or equal to n, and n is a natural number.
In this embodiment, the acquired first medical text depends on a large amount of medical data, and the medical data is easy to acquire. And the non-labeled data which are not identified in the first medical text are used as the second medical text, so that the second medical text can be identified conveniently. In other words, in this embodiment, the second medical text identified in the second stage is a second identification of the unlabeled data, and the medical text to be corrected obtained after the two screening identifications is more accurate.
Illustratively, as shown in fig. 5, it is another flowchart of the medical text error correction method provided by the embodiment of the present invention, wherein the pre-training language model BERT is correctedbioObtaining the medical text to be corrected, including:
s201, establishing a classification model.
S202, predicting the probability distribution Prob of the second medical text through the classification model.
S203, screening the medical text to be corrected according to the probability distribution Prob.
It should be noted that this embodiment can be understood as the second stage described above in this application.
For example, the classification model established in step S201 is further explained:
the establishing of the classification model comprises the following steps:
defining a first input sequence XnAnd in said first input sequence XnSource end add tag [ CLS ]],
Xn=[x0,x1…xi…xn] (3)
The first input sequence X to be taggednPre-trained language model BERTbioTo obtain a first input vector E,
E=[e0,et,e2...ei...en] (4)
wherein e isiA first input vector representing an ith word/character of the second medical text;
encoding each word/character in the second medical textTrm(ei),
Figure BDA0002972001690000091
Wherein the content of the first and second substances,
Figure BDA0002972001690000092
a hidden layer vector representing an ith word/character of an nth layer in the second medical text,
i is more than or equal to 0 and less than or equal to n, and n is a natural number.
For example, the probability distribution Prob of the second medical text predicted by the classification model in step S202 is further explained as follows:
the predicting, by the classification model, the probability distribution Prob of the second medical text comprises:
obtaining a first hidden layer vector of the nth layer in the second medical text
Figure BDA0002972001690000093
Figure BDA0002972001690000094
The first word/character in the n hidden layer vector is used for encoding the first word/character in the n hidden layer vector
Figure BDA0002972001690000095
The linear transformation C is carried out, and the linear transformation C is carried out,
Figure BDA0002972001690000096
predicting a probability distribution Prob of the second medical text,
Prob=softmax(C) (8)
wherein the content of the first and second substances,
Figure BDA0002972001690000097
a first character representing an nth layer of the second medical textHidden layer vector of (1).
In the second stage, each character/character in the second medical text is classified through the classification model, the probability distribution of each character/character in the second medical text is predicted, and then the characters/characters in the second medical text are identified, error characters/characters are screened out, and the error correction precision is improved.
Exemplarily, taking "novel coronene virus" as an example, sequentially identifying 6 characters, predicting a probability distribution, "new" probability distribution may be "1", "type" probability distribution may be "1", "coronene" probability distribution may be "0", and the like, which are not described herein again. Then the coronene is judged as the character to be corrected.
The above third stage is explained in detail below:
exemplarily, as shown in fig. 6, it is another flowchart of a medical text correction method provided by an embodiment of the present invention, where correcting the medical text to be corrected includes:
s301, encoding Self for the medical text to be correctedenc
Figure BDA0002972001690000101
Wherein the content of the first and second substances,
Figure BDA0002972001690000102
representing a code SelfencThe hidden layer vector of the ith word/character of the nth layer of the medical text to be corrected, viRepresenting a code SelfencInputting an ith word/character input vector of the medical text to be corrected;
s302, decoding Self for the coded medical text to be correcteddec
Figure BDA0002972001690000103
Wherein the content of the first and second substances,
Figure BDA0002972001690000104
representing decoding SelfdecThe hidden layer vector u of the ith word/character of the nth layer of the medical text to be correctediRepresenting a code SelfencThe input vector of the ith word/character of the medical text to be corrected, hNRepresenting a code SelfencThe hidden state of the nth layer of the medical text to be corrected;
s303, predicting the probability distribution of the text to be corrected to obtain the corrected medical text.
Illustratively, the step S301 is performed to encode Self of the medical text to be correctedencFor a detailed explanation:
the Self for coding the medical text to be correctedencThe method comprises the following steps:
defining a second input sequence Ln
Ln=[l0,l1...li...ln] (11)
The second input sequence LnPre-trained language model BERTbioTo obtain a second input vector V,
V=[v0,v1,v2...vi...vn] (12)
wherein v isiRepresenting a code SelfencThe input vector of the ith word/character of the medical text to be corrected.
Illustratively, Self is used for decoding the encoded medical text to be corrected in step S302decFor a detailed explanation:
self for decoding coded medical text to be correcteddecThe method comprises the following steps:
defining a third input sequence Yn
Yn=[y0,y1...yi...yn] (13)
Inputting the third input sequence YnPre-trained language model BERTbioTo obtain a third input vector U,
U=[u0,u1,u2...ui...un] (14)
wherein u isiRepresenting decoding SelfdecThe input vector of the ith word/character of the medical text to be corrected.
Illustratively, the step S303 of predicting the probability distribution of the text to be corrected to obtain a detailed explanation of the corrected medical text is as follows:
the predicting the probability distribution of the text to be corrected to obtain the corrected medical text comprises the following steps:
obtaining decoded SelfdecThe hidden state f of the nth layer of the medical text to be correctedN
Will decode SelfdecThe hidden state f of the nth layer of the medical text to be correctedNMaking linear transformations
Figure BDA0002972001690000111
Figure BDA0002972001690000112
Wherein f isNRepresenting decoding SelfdecThe hidden layer vectors of all words/characters of the nth layer of the medical text to be corrected,
Figure BDA0002972001690000113
linear transformation representing ith word/character of nth layer;
predicting a probability distribution of each word/character of the medical text to be corrected
Figure BDA0002972001690000114
Figure BDA0002972001690000115
Wherein, WiAnd biIs a parameter of the probability distribution;
calculating the maximum probability z of each word/character of the medical text to be correctedi
Figure BDA0002972001690000116
Acquiring a corrected medical text Z according to the maximum probability of each word/character of the medical text to be corrected,
Z=[z1,z2...zi...zn] (18)
wherein i is more than or equal to 0 and less than or equal to n, and n is a natural number.
And correcting the words/characters in the text to be corrected through the error correction of the third stage. In the medical text error correction method, a pre-training language model BERT in the medical field is establishedbioThe second medical text is recognized (which can be understood as establishing a discriminant model), the medical text to be corrected is corrected (which can be understood as establishing an error correction model), and the wrong characters and the different characters in the medical field are corrected through two different models.
In another embodiment, the present application provides a medical text correction device, including:
a training unit for establishing a pre-training language model BERT in the medical fieldbio
A processing unit for correcting the pre-trained language model BERTbioAnd obtaining the medical text to be corrected.
And the correcting unit is used for correcting the medical text to be corrected.
The medical text correction device in this embodiment may include a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs including instructions for performing the steps of any of the above embodiments.
The present embodiment can achieve the beneficial effects in the above embodiments by executing the instructions in the steps in the above embodiments. In the embodiment of the invention, a pre-training language model BERT in the medical field is establishedbioSaid pre-trained language model BERTbioRelying on relatively easily accessible large-scale medical text (initial corpus) and pre-trained language model BERT refined using large amounts of external medical databioMaking it more accurate than a language model that uses only the initial corpus. Thus, by correcting the pre-trained language model BERTbioThe method for obtaining the medical text to be corrected and further correcting the medical text to be corrected improves the coverage rate and the applicability of text correction of the medical text. And because the method takes the characters as the minimum processing unit, the fineness is improved, and compared with the prior art, the method does not need to carry out word segmentation, and avoids the problem that the quality of the current word segmentation model is generally low.
In another embodiment, the present application further provides a computer storage medium, where the computer storage medium stores a computer program, where the computer program is executed by a processor to implement part or all of the steps of any one of the medical text correction methods described in any one of the above embodiments.
The present embodiment can achieve all the advantages of the above embodiments by performing some or all of the steps of the above embodiments. In the embodiment of the invention, a pre-training language model BERT in the medical field is establishedbioSaid pre-trained language model BERTbioRelying on relatively easily accessible large-scale medical text (initial corpus) and pre-trained language model BERT refined using large amounts of external medical databioMaking it more accurate than a language model that uses only the initial corpus. Thus, by correcting the pre-trained language model BERTbioTo obtainThe method for correcting the medical text to be corrected improves the coverage rate and the applicability of text correction of the medical text. And because the method takes the characters as the minimum processing unit, the fineness is improved, and compared with the prior art, the method does not need to carry out word segmentation, and avoids the problem that the quality of the current word segmentation model is generally low.
Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any one of the medical text correction methods as recited in the above method embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.
The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, which can store program codes.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (12)

1. A medical text correction method, characterized by comprising:
establishing a pre-trained language model BERT in the medical fieldbio
Correcting the pre-training language model BERTbioTo obtain the medical staff to be correctedA treatment text;
and correcting the medical text to be corrected.
2. The method of claim 1, wherein the building of a pre-trained language model BERT for the medical domainbioThe method comprises the following steps:
acquiring a first medical text;
identifying and acquiring label-free data R in the first medical textnThe label-free data R is addednAs the second medical-treatment text, there is,
Rn=[s1,s2 … si … sn] (1)
wherein s ═ w0,w1 … wi … wn]S represents each text of the second medical text, w represents each word/character of the second medical text;
training the pre-training language model BERT with the second medical textbioSaid pre-trained language model BERTbioThe training target of (a) is P,
P=(wi|w0 … wi,wi+1 … wn) (2)
wherein i is more than or equal to 0 and less than or equal to n, and n is a natural number.
3. The medical text correction method according to claim 2, wherein the correcting the pre-trained language model BERTbioObtaining the medical text to be corrected, including:
establishing a classification model;
predicting a probability distribution Prob of the second medical text by the classification model;
and screening the medical text to be corrected according to the probability distribution Prob.
4. The medical text correction method according to claim 3, wherein the establishing a classification model includes:
defining a first input sequence XnAnd in said first input sequence XnSource end add tag [ CLS ]],
Xn=[x0,x1 … xi … xn] (3)
The first input sequence X to be taggednPre-trained language model BERTbioTo obtain a first input vector E,
E=[e0,e1,e2 … ei … en] (4)
wherein e isiA first input vector representing an ith word/character of the second medical text;
encoding Trm (e) for each word/character in the second medical texti),
Figure FDA0002972001680000021
Wherein the content of the first and second substances,
Figure FDA0002972001680000022
a hidden layer vector representing an ith word/character of an nth layer in the second medical text,
i is more than or equal to 0 and less than or equal to n, and n is a natural number.
5. The medical text correction method according to claim 4, wherein the predicting the probability distribution Prob of the second medical text by the classification model comprises:
obtaining a first hidden layer vector of the nth layer in the second medical text
Figure FDA0002972001680000023
Figure FDA0002972001680000024
The first word/character in the n hidden layer vector is used for encoding the first word/character in the n hidden layer vector
Figure FDA0002972001680000025
The linear transformation C is carried out, and the linear transformation C is carried out,
Figure FDA0002972001680000026
predicting a probability distribution Prob of the second medical text,
Prob=softmax(C) (8)
wherein the content of the first and second substances,
Figure FDA0002972001680000027
a hidden layer vector representing a first character of an nth layer in the second medical text.
6. The medical text correction method according to claim 1, wherein the correcting the medical text to be corrected includes:
self for encoding the medical text to be correctedenc
Figure FDA0002972001680000028
Wherein the content of the first and second substances,
Figure FDA0002972001680000029
representing a code SelfencThe hidden layer vector of the ith word/character of the nth layer of the medical text to be corrected, viRepresenting a code SelfencInputting an ith word/character input vector of the medical text to be corrected;
self for decoding coded medical text to be correcteddec
Figure FDA0002972001680000031
Wherein f isi nRepresenting decoding SelfdecThe hidden layer vector u of the ith word/character of the nth layer of the medical text to be correctediRepresenting a code SelfencThe input vector of the ith word/character of the medical text to be corrected, hNRepresenting a code SelfencThe hidden state of the nth layer of the medical text to be corrected;
and predicting the probability distribution of the text to be corrected to obtain the corrected medical text.
7. The method according to claim 6, wherein the Self encoding the medical text to be corrected is SelfencThe method comprises the following steps:
defining a second input sequence Ln
Ln=[l0,l1 … li … ln] (11)
The second input sequence LnPre-trained language model BERTbioTo obtain a second input vector V,
V=[v0,v1,v2 … vi … vn] (12)
wherein v isiRepresenting a code SelfencThe input vector of the ith word/character of the medical text to be corrected.
8. The method of claim 6, wherein the encoded medical text to be corrected is decoded by SelfdecThe method comprises the following steps:
defining a third input sequence Yn
Yn=[y0,y1 … yi … yn] (13)
Inputting the third input sequence YnPre-trained language model BERTbioTo obtain a third input vector U,
U=[u0,u1,u2 … ui … un] (14)
wherein u isiRepresenting decoding SelfdecThe input vector of the ith word/character of the medical text to be corrected.
9. The method according to claim 6, wherein the predicting the probability distribution of the text to be corrected to obtain the corrected medical text comprises:
obtaining decoded SelfdecThe hidden state f of the nth layer of the medical text to be correctedN
Will decode SelfdecThe hidden state f of the nth layer of the medical text to be correctedNMaking linear transformations
Figure FDA0002972001680000041
Figure FDA0002972001680000042
Wherein f isNRepresenting decoding SelfdecThe hidden layer vectors of all words/characters of the nth layer of the medical text to be corrected,
Figure FDA0002972001680000043
linear transformation representing ith word/character of nth layer;
predicting a probability distribution of each word/character of the medical text to be corrected
Figure FDA0002972001680000044
Figure FDA0002972001680000045
Wherein, WiAnd biIs a parameter of the probability distribution;
calculating the maximum probability z of each word/character of the medical text to be correctedi
Figure FDA0002972001680000046
Acquiring a corrected medical text Z according to the maximum probability of each word/character of the medical text to be corrected,
Z=[z1,z2 … zi … zn] (18)
wherein i is more than or equal to 0 and less than or equal to n, and n is a natural number.
10. A medical text correction apparatus, characterized in that the medical text correction apparatus comprises:
a training unit for establishing a pre-training language model BERT in the medical fieldbio
A processing unit for correcting the pre-trained language model BERTbioObtaining a medical text to be corrected;
and the correcting unit is used for correcting the medical text to be corrected.
11. A medical text correction device, comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps in the method of any of claims 1-9.
12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executed by a processor to implement the method according to any one of claims 1-9.
CN202110264865.9A 2021-03-12 2021-03-12 Medical text error correction method, device and storage medium Pending CN112861519A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110264865.9A CN112861519A (en) 2021-03-12 2021-03-12 Medical text error correction method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110264865.9A CN112861519A (en) 2021-03-12 2021-03-12 Medical text error correction method, device and storage medium

Publications (1)

Publication Number Publication Date
CN112861519A true CN112861519A (en) 2021-05-28

Family

ID=75994052

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110264865.9A Pending CN112861519A (en) 2021-03-12 2021-03-12 Medical text error correction method, device and storage medium

Country Status (1)

Country Link
CN (1) CN112861519A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113191119A (en) * 2021-06-02 2021-07-30 云知声智能科技股份有限公司 Method, apparatus and storage medium for training text error correction model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180349327A1 (en) * 2017-06-05 2018-12-06 Baidu Online Network Technology (Beijing)Co., Ltd. Text error correction method and apparatus based on recurrent neural network of artificial intelligence
CN111259625A (en) * 2020-01-16 2020-06-09 平安科技(深圳)有限公司 Intention recognition method, device, equipment and computer readable storage medium
CN112002323A (en) * 2020-08-24 2020-11-27 平安科技(深圳)有限公司 Voice data processing method and device, computer equipment and storage medium
CN112016310A (en) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 Text error correction method, system, device and readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180349327A1 (en) * 2017-06-05 2018-12-06 Baidu Online Network Technology (Beijing)Co., Ltd. Text error correction method and apparatus based on recurrent neural network of artificial intelligence
CN111259625A (en) * 2020-01-16 2020-06-09 平安科技(深圳)有限公司 Intention recognition method, device, equipment and computer readable storage medium
CN112002323A (en) * 2020-08-24 2020-11-27 平安科技(深圳)有限公司 Voice data processing method and device, computer equipment and storage medium
CN112016310A (en) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 Text error correction method, system, device and readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113191119A (en) * 2021-06-02 2021-07-30 云知声智能科技股份有限公司 Method, apparatus and storage medium for training text error correction model

Similar Documents

Publication Publication Date Title
CN110717039B (en) Text classification method and apparatus, electronic device, and computer-readable storage medium
CN107220235B (en) Speech recognition error correction method and device based on artificial intelligence and storage medium
CN107293296B (en) Voice recognition result correction method, device, equipment and storage medium
CN107273356B (en) Artificial intelligence based word segmentation method, device, server and storage medium
CN110795938B (en) Text sequence word segmentation method, device and storage medium
CN108090043B (en) Error correction report processing method and device based on artificial intelligence and readable medium
CN110222330B (en) Semantic recognition method and device, storage medium and computer equipment
CN111814479B (en) Method and device for generating enterprise abbreviations and training model thereof
CN111897954A (en) User comment aspect mining system, method and storage medium
CN111597807B (en) Word segmentation data set generation method, device, equipment and storage medium thereof
CN111046659A (en) Context information generating method, context information generating device, and computer-readable recording medium
CN115810068A (en) Image description generation method and device, storage medium and electronic equipment
CN116303537A (en) Data query method and device, electronic equipment and storage medium
CN113435499B (en) Label classification method, device, electronic equipment and storage medium
CN114780701A (en) Automatic question-answer matching method, device, computer equipment and storage medium
CN112861519A (en) Medical text error correction method, device and storage medium
CN113903420A (en) Semantic label determination model construction method and medical record analysis method
CN112163434B (en) Text translation method, device, medium and electronic equipment based on artificial intelligence
CN112270184A (en) Natural language processing method, device and storage medium
CN113011531A (en) Classification model training method and device, terminal equipment and storage medium
CN113051894A (en) Text error correction method and device
CN109670040B (en) Writing assistance method and device, storage medium and computer equipment
CN112016281B (en) Method and device for generating wrong medical text and storage medium
CN115796141A (en) Text data enhancement method and device, electronic equipment and storage medium
CN112988996B (en) Knowledge base generation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination