CN111931490A - Text error correction method, device and storage medium - Google Patents

Text error correction method, device and storage medium Download PDF

Info

Publication number
CN111931490A
CN111931490A CN202011030582.XA CN202011030582A CN111931490A CN 111931490 A CN111931490 A CN 111931490A CN 202011030582 A CN202011030582 A CN 202011030582A CN 111931490 A CN111931490 A CN 111931490A
Authority
CN
China
Prior art keywords
word
corrected
text
text sequence
replaced
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011030582.XA
Other languages
Chinese (zh)
Other versions
CN111931490B (en
Inventor
郭招
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202011030582.XA priority Critical patent/CN111931490B/en
Publication of CN111931490A publication Critical patent/CN111931490A/en
Application granted granted Critical
Publication of CN111931490B publication Critical patent/CN111931490B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools

Abstract

The application relates to the technical field of medical science and technology, and particularly discloses a text error correction method, a text error correction device and a storage medium. The method comprises the following steps: acquiring a text to be corrected in the medical field; inputting the text to be corrected in the medical field into the network model corresponding to the medical field after fine tuning, and determining the word to be corrected in the text to be corrected; and correcting the word to be corrected through the network model to obtain a text after error correction. The method and the device are beneficial to improving the text error correction precision.

Description

Text error correction method, device and storage medium
Technical Field
The present application relates to the field of text recognition technologies, and in particular, to a text error correction method, apparatus, and storage medium.
Background
Text error correction is a basic task of Natural Language Processing (NLP), and is one of basic modules with better functions such as search engine, speech recognition, content review, and the like. For example, in the medical field, the historical cases required by doctors can be quickly retrieved by text correction, so that the diagnosis efficiency of doctors is improved.
The current text error correction method flow generally comprises error detection and error correction. The error detection of the text needs to be realized through one model, and the error correction needs to be realized through the other model. Due to the fact that the two models are trained independently, coordination work is needed in the error correction process, however, due to the fact that training scenes or training purposes between the two models are different, the optimal solution is difficult to output between the two models, and the accuracy of text error correction is low.
Disclosure of Invention
The embodiment of the application provides a text error correction method, a text error correction device and a storage medium. And the error detection and correction are carried out on the text to be corrected through a network model after fine tuning, coordination among models is not needed, and the text correction precision is improved.
In a first aspect, an embodiment of the present application provides a text error correction method, including:
acquiring a text to be corrected in the medical field;
inputting the text to be corrected in the medical field into the network model corresponding to the medical field after fine tuning, and determining the word to be corrected in the text to be corrected;
and correcting the word to be corrected through the network model to obtain a text after error correction.
In a second aspect, an embodiment of the present application provides a text error correction apparatus, including:
the acquiring unit is used for acquiring a text to be corrected in the medical field;
the processing unit is used for inputting the text to be corrected in the medical field into the network model corresponding to the medical field after fine tuning, and determining the word to be corrected in the text to be corrected;
and the processing unit is also used for correcting the word to be corrected through the network model to obtain a text after error correction.
In a third aspect, embodiments of the present application provide a text correction device, including a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor, and the program includes instructions for executing steps in the method according to the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program, where the computer program makes a computer execute the method according to the first aspect.
In a fifth aspect, embodiments of the present application provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program, the computer being operable to cause a computer to perform the method according to the first aspect.
The embodiment of the application has the following beneficial effects:
it can be seen that, in the embodiment of the application, through a network model corresponding to the medical field after fine tuning, error detection and correction can be performed on the text to be corrected in the medical field. Therefore, coordination among models is not needed in the error correction process, and error correction can be completed by using one model, so that the text error correction precision is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a text error correction method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a network model according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a network model training method according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a text error correction apparatus according to an embodiment of the present application;
fig. 5 is a block diagram illustrating functional units of a text error correction apparatus according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, result, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
Referring to fig. 1, fig. 1 is a schematic flow chart of a text error correction method according to an embodiment of the present application. The method is applied to a text error correction device. The method comprises the following steps:
101: the text correction device acquires a text to be corrected in the medical field.
The text to be corrected may be a text to be corrected input by a user, for example, a text to be corrected input by a user in a search box or a text to be corrected input in a dialog box.
102: and the text error correction device inputs the text to be corrected in the medical field into the network model corresponding to the medical field after fine tuning, and determines the word to be corrected in the text to be corrected.
The medical model is a network model which is finely adjusted in advance and corresponds to the medical field, and a training process of the network model is described in detail later and will not be described herein.
Illustratively, each word in the text to be corrected can be encoded through the network model to obtain a word vector corresponding to each word; determining a score corresponding to each word according to the feature vector corresponding to each word, for example, inputting the feature vector of each word into a full connection layer of the network model for classification to obtain the score corresponding to each word, and then determining a word to be corrected in the text to be corrected according to the score of each word, for example, determining a word with the score smaller than a first threshold as the word to be corrected, wherein encoding each word can be realized by bert encoding.
In one embodiment of the present application, each word may be encoded to obtain a word vector corresponding to each word; and then, fusing the word vectors corresponding to each word based on the self-attention mechanism of the network model to obtain the target feature vector corresponding to each word.
103: and the text error correction device corrects the error of the word to be corrected through the network model to obtain an error-corrected text.
For example, at least one candidate word to be corrected corresponding to the word to be corrected, such as a shape word, a pronunciation word, and the like, may be obtained from the dictionary library; then, each candidate word to be corrected in the at least one candidate word to be corrected is used for replacing the word to be corrected in the text to be corrected, so as to obtain a new text; and finally, coding each word in the new text through a network model to obtain the score of each candidate word to be corrected, and replacing the candidate word to be corrected with the largest score to obtain the corrected text.
It can be seen that, in the embodiment of the application, through a network model corresponding to the medical field after fine tuning, error detection and correction can be performed on the text to be corrected in the medical field. Therefore, coordination among models is not needed in the error correction process, and error correction can be completed by using one model, so that the text error correction precision is improved.
In an embodiment of the application, at least one candidate word to be corrected corresponding to the word to be corrected can be further obtained from the dictionary library; then, an entity type of each candidate word to be corrected in the at least one candidate word to be corrected is obtained, for example, if the text to be corrected is "two-hundred-year-old watermelon slices", the word to be corrected is "melon", and the candidate word to be corrected corresponding to the word to be corrected includes "guanidine", "jacket", and the like. And the entity type of the candidate word "guanidine" to be corrected is a medical entity; coding each candidate word to be corrected and the entity type corresponding to each candidate word to be corrected through a network model to obtain a coding vector corresponding to each candidate word to be corrected; performing fusion processing on a word vector corresponding to each word in the text to be corrected and a coding vector corresponding to each candidate word to be corrected through the network model to obtain a target characteristic vector of each candidate word to be corrected; obtaining a score corresponding to each candidate word to be corrected according to the target characteristic vector of each candidate word to be corrected, wherein the score is used for expressing the reasonability of the text after each candidate word to be corrected is used for replacing the word to be corrected in the text to be corrected; and finally, replacing the word to be corrected in the text to be corrected by using the candidate word to be corrected with the largest score to obtain the text after error correction.
For example, as shown in fig. 2, each candidate word to be corrected and the entity type corresponding to each candidate word to be corrected may be encoded through the network model. For example, each candidate word may be encoded to obtain a first word vector corresponding to each candidate word to be corrected, for example, the first word vector may be encoded by a word embedding method; then, coding the entity type corresponding to each candidate word to be corrected to obtain a second word vector corresponding to each candidate word to be corrected; finally, performing bitwise superposition on the first word vector and the second word vector corresponding to each candidate word to be corrected to obtain a coding feature vector corresponding to each candidate word to be corrected; in addition, each word in the text to be corrected is encoded through a text encoding layer of the network model, and a word vector corresponding to each word is obtained.
Furthermore, the coding vector corresponding to each candidate word to be corrected is fused with the word vector corresponding to each word to obtain the target feature vector corresponding to each candidate word to be corrected. Illustratively, in a similar manner to the self-attention mechanism, the encoding vector corresponding to each candidate word to be corrected is used as a query vector query, and the word vector corresponding to each word is used as a key-value pair (key-value); then, determining the similarity between the coding vector and the word vector of each candidate word to be corrected, and normalizing the similarity to obtain the weight between each candidate word to be corrected and each word; and finally, weighting the word vector corresponding to each word according to the weight between the candidate word to be corrected and each word to obtain the target characteristic vector corresponding to the candidate word to be corrected.
For example, the target feature vector corresponding to each candidate word to be corrected can be represented by formula (1):
Figure 454464DEST_PATH_IMAGE001
wherein, betaiThe target feature vector corresponding to the ith candidate word in the at least one candidate word is represented, the value of i is 1-m, m is the number of the at least one candidate word, softmax is normalization operation, dist is distance solving operation, z isiFor the coding vector corresponding to the i-th word candidate, xjAnd j is the jth word in the text to be corrected, the value of j is 1-n, and n is the number of words in the text to be corrected.
Finally, inputting the target characteristic vector corresponding to each candidate word to be corrected into the full-connection layer, grading and classifying each candidate word to be corrected, and grading and classifying the grade corresponding to each candidate word to be corrected, wherein the grading and classifying process is similar to a general classifying process and is not described; and then, replacing the word to be corrected in the text to be corrected by using the candidate word to be corrected with the highest score to obtain the text after error correction.
It can be seen that, in the embodiment of the present application, the encoding vector of each candidate word to be corrected is fused with each word in the text to be corrected, it can be understood that, if the candidate word to be corrected is reasonable, the similarity between the candidate word to be corrected and a part of words (words except for the word to be corrected) is relatively high, so that the target feature vector obtained by fusion can represent the semantic features to be expressed by the text to be corrected, when performing subsequent scoring, a relatively high score can be obtained, and thus the score is more convincing after fusion. Moreover, in the process of encoding the candidate words to be corrected, the entity type of each candidate word is fused, and because the entity type is fused in the encoding vector, whether each candidate word to be corrected is matched with the text to be corrected or not is determined from the perspective of the entity type in the process of calculating the similarity, so that the calculation of the similarity is more accurate, the subsequent scoring is more accurate, some candidate words to be corrected, which have the same text semantics but different entity types, are excluded, and the error correction precision is improved.
In an embodiment of the present application, the text error correction method of the present application may also be applied to the field of smart medical technology, for example, the text error correction method of the present application corrects an error text input by a doctor to obtain an error corrected text, and the error corrected text is used to perform case search, so that a correct case can be searched, a correct case is provided for diagnosis of the doctor, diagnosis accuracy is improved, and development of medical technology is promoted.
The following describes a process of performing text error correction with reference to a schematic structural diagram of a network model shown in fig. 2.
As shown in fig. 2, inputting a text [ x1, x2, x3, … …, xn ] to be corrected into the network model, and coding each word in the text to be corrected through the network model to obtain a word vector [ h1, h2, h3, … …, hn ] corresponding to each word; then, determining words to be corrected in the text to be corrected according to the word vectors [ h1, h2, h3, … …, hn ] of each word, illustratively, inputting the word vectors of each word into a full-link layer to obtain the score of each word, and then, taking the words with the scores smaller than a threshold value as the words to be corrected;
then, at least one candidate word [ y1, y2, y3 … …, ym ] to be corrected corresponding to the word to be corrected is obtained; then, coding each candidate word to be corrected and the entity type corresponding to each candidate word to be corrected to obtain a coding vector [ z1, y2, y3 … …, ym ] corresponding to each candidate word to be corrected; then, carrying out fusion processing on the coding vector [ z1, y2, y3 … …, ym ] corresponding to each candidate word to be corrected and the word vector [ x1, x2, x3, … …, xn ] corresponding to each word to obtain a target feature vector [ z1, y2, y3 … …, ym ] corresponding to each candidate word to be corrected; and finally, inputting the target characteristic vector [ z1, y2, y3 … …, ym ] corresponding to each candidate word to be corrected into a full connection layer, determining the score [ score 1, score 2, score 3, … …, score m ] corresponding to each candidate word to be corrected, and replacing the word to be corrected with the candidate word to be corrected with the largest score to obtain the corrected text.
Referring to fig. 3, fig. 3 is a schematic flowchart of a network model training method according to an embodiment of the present disclosure. The method comprises the following steps:
301: a first text sequence is obtained.
Wherein, the first text sequence is the original text sequence and is a correct text sequence. The first text sequence may be a text sequence in the medical field, or may not be a text sequence in the medical field, which is not limited in this application.
302: determining words to be replaced in the first text sequence, wherein the words to be replaced are partial words in the first text sequence.
For example, a random sampling rate may be generated by a random function, and the first text sequence is subjected to the random sampling rate to determine words to be replaced in the first text sequence, for example, if the random sampling rate is 0.5, 50% of words in the first text sequence may be used as words to be replaced. The words to be replaced are determined through the random sampling rate, so that the probability that each word in any one text is input wrongly in the daily life is simulated to be random, the corpus richness of the replaced text sequence is ensured, and the generalization capability of the trained network model is strong.
303: and replacing the words to be replaced to obtain at least one second text sequence corresponding to the first text sequence.
Illustratively, the word to be replaced comprises a first word to be replaced and a second word to be replaced, wherein the first word to be replaced is a partial word in the word to be replaced, the second word to be replaced is also a partial word in the word to be replaced, and a word which is not replaced exists in the word to be replaced. For example, 30% of the words in the replacement words can be the first word to be replaced, 40% of the words can be the second word to be replaced, and the rest of the words can be the words that are not replaced.
For example, at least one first candidate word may be randomly obtained from a dictionary database, and at least one second candidate word corresponding to the second word to be replaced may be obtained from the dictionary database, where each of the at least one second candidate word is one of: harmonic words, near words, shape near words and words with reversed word order corresponding to the second word to be replaced; and replacing the first word to be replaced by each first candidate word in the at least one candidate word and replacing the second word to be replaced by each second candidate word to obtain at least one second text sequence corresponding to the first text sequence.
Illustratively, by the above-described alternative, a first text sequence is generalized into a plurality of second text sequences. And randomly selecting q first candidate words, and determining w second candidate words corresponding to the second word to be replaced, so that the first text sequence can be generalized into q w second text sequences.
It can be understood that by randomly replacing the first word to be replaced, certain errors in the daily life are simulated to be random, and the wrong words are not associated with the original correct words, so that the richness of the second text sequence corpus is increased; and replacing the second word to be replaced by using the second candidate word to simulate the situation that people often input some characters in error into similar characters in daily life, thereby further increasing the richness of the linguistic data of the second text sequence. Therefore, the generalization capability of the network model trained by using the second text sequence is stronger.
304: and taking each second text sequence in the at least one second text sequence as a training sample, and training the network model to obtain a pre-training model.
Exemplarily, each second text sequence is used as a training sample, and each second text sequence is input into the network model to obtain a prediction result of each word in the second text sequence, wherein the prediction result of each word is used for indicating whether each predicted word is replaced or not and a third text sequence obtained by correcting the error of each second text sequence; then, a first loss is obtained according to the prediction result of each word and the real result of each word, wherein the real result of each word is used for indicating whether each word which is labeled in advance is replaced or not. For example, if the first text sequence is "metformin pill" and the second text sequence is "metformin pill", the labeling result of the second text sequence may be (00010), where 0 represents that the corresponding word is not replaced and 1 represents that the corresponding word is replaced. Therefore, the first loss can be expressed by equation (2):
Figure 832618DEST_PATH_IMAGE002
wherein L is1For the first penalty, N is the number of words in the second text sequence, Cross _ Encopy is the Cross Entropy penalty function, PiFor the real result of the ith word of the N words,
Figure 655081DEST_PATH_IMAGE003
is the predicted result of the ith word.
Then, according to the third text sequence and the first text sequence, a second loss is obtained, that is, the first text sequence is used as a supervision tag of the third text sequence, and then, a difference between the third text sequence and the first text sequence is determined and is used as the second difference.
Finally, network parameters of the network model are adjusted according to the first loss and the second loss. And adjusting the network parameters of the network model according to the first target loss and a gradient descent method until the network model converges to obtain a pre-training model.
It can be understood that, for the medical field, there are fewer wrong medical texts and fewer entities, so it is difficult to construct a second text sequence with rich corpus. Therefore, a pre-training model is trained through a large number of training samples in other fields, and then fine tuning is carried out, so that a network model with relatively high error correction precision can be obtained.
305: and fine-tuning the pre-training model to obtain a network model corresponding to the medical field after fine tuning.
Exemplarily, a fourth text sequence of the medical field is obtained, entities in the fourth text sequence are determined, and at least one candidate entity corresponding to the entities is stored in a pre-constructed entity dictionary library corresponding to the medical field; and then, replacing the entities in the fourth text sequence by using each candidate entity in the at least one candidate entity to obtain at least one fifth text sequence. For example, the fourth text sequence is "i want to eat metformin hydrochloride tablets", it is determined that the entity in the fourth text sequence is "metformin hydrochloride tablets", and then, candidate entities corresponding to the entity are obtained from the entity dictionary library as "reserpine" and "metformin hydrochloride tablets", and then, the fifth text sequence "i want to eat reserpine" and "i want to eat metformin hydrochloride tablets" can be obtained by entity replacement; and finally, fine tuning the pre-training model by using each fifth text sequence in the at least one fifth text sequence respectively to obtain a fine-tuned network model corresponding to the medical field.
For example, the word to be replaced in each fifth text sequence may be determined according to the manner of determining the word to be replaced in step 302, and then the word to be replaced in the fifth text sequence is replaced according to the replacement manner in step 303, so as to obtain at least one sixth text sequence corresponding to each fifth text sequence.
Similarly, each word in the sixth text sequence is labeled with a real result of whether the word is replaced or not in advance; then, inputting each sixth text sequence as a training sample into the pre-training model to obtain a prediction result of whether each word in the sixth text sequence is replaced and an error-corrected text sequence; determining a third loss according to the predicted result and the real result of whether each word is replaced; obtaining a fourth loss according to the corrected text sequence and the supervision tag (i.e. a fifth text sequence) corresponding to each sixth text sequence; and finally, carrying out weighting processing on the third loss and the fourth loss to obtain a second target loss, and carrying out fine adjustment on the pre-training model according to the second target loss and a gradient descent method until the pre-training model is converged to obtain a fine-adjusted network model corresponding to the medical field.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a text error correction apparatus according to an embodiment of the present application. As shown in fig. 4, the text correction device 400 includes a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs including instructions for performing the steps of:
acquiring a text to be corrected in the medical field;
inputting the text to be corrected in the medical field into the network model corresponding to the medical field after fine tuning, and determining the word to be corrected in the text to be corrected;
and correcting the word to be corrected through the network model to obtain a text after error correction.
In some possible embodiments, in terms of correcting the word to be corrected through the network model to obtain a corrected text, the program is specifically configured to execute the following instructions:
acquiring at least one candidate word to be corrected corresponding to the word to be corrected;
obtaining an entity type corresponding to each candidate word to be corrected in the at least one candidate word to be corrected, and coding each candidate word to be corrected and the entity type corresponding to each candidate word to be corrected through the network model to obtain a coding vector corresponding to each candidate word to be corrected;
performing fusion processing on a word vector corresponding to each word in the text to be corrected and a coding vector corresponding to each candidate word to be corrected through the network model to obtain a target characteristic vector of each candidate word to be corrected;
obtaining a score corresponding to each candidate word to be corrected according to the target feature vector of each candidate word to be corrected, wherein the score is used for representing the reasonableness of the text after the word to be corrected in the text to be corrected is replaced by using each candidate word to be corrected;
and replacing the word to be corrected in the text to be corrected by using the candidate word to be corrected with the largest score to obtain the text after error correction.
In some possible embodiments, before obtaining the text to be corrected, the program is further for executing the following steps:
acquiring a first text sequence;
determining words to be replaced in the first text sequence, wherein the words to be replaced are partial words in the first text sequence;
replacing the word to be replaced to obtain at least one second text sequence corresponding to the first text sequence;
taking each second text sequence in the at least one second text sequence as a training sample, and training the network model to obtain a pre-training model;
and fine-tuning the pre-training model to obtain a network model corresponding to the medical field after fine tuning.
In some possible embodiments, in determining the word to be replaced in the first text sequence, the program is specifically adapted to execute the following steps:
generating a random sampling rate by a random function;
and sampling the first text sequence according to the random sampling rate to obtain words to be replaced in the first text sequence.
In some possible embodiments, the word to be replaced includes a first word to be replaced and a second word to be replaced, and in terms of replacing the word to be replaced to obtain at least one second text sequence corresponding to the first text sequence, the above program is specifically configured to execute the following steps:
randomly acquiring at least one first candidate word from a dictionary library;
obtaining at least one second candidate word corresponding to the second word to be replaced from the dictionary database, wherein each second candidate word in the at least one second candidate word is one of the following words: harmonic words, near words, shape near words and words with reversed word order corresponding to the second word to be replaced;
and replacing the first candidate word with each first candidate word in the at least one first candidate word and replacing the second candidate word with each second candidate word to obtain at least one second text sequence corresponding to the first text sequence.
In some possible embodiments, in regard to training the network model by using each of the at least one second text sequence as a training sample to obtain a pre-training model, the above program is specifically configured to execute the following steps:
inputting each second text sequence into the network model to obtain a predicted result of each word in the second text sequence, wherein the predicted result of each word is used for indicating whether each predicted word is replaced or not; and a third text sequence after error correction is carried out on each second text sequence;
obtaining a first loss according to the prediction result and the real result of each word, wherein the real result of each word is used for indicating whether each word which is labeled in advance is replaced or not;
obtaining a first loss according to the prediction result and the real result of each word;
obtaining a second loss according to the third text sequence and the first text sequence;
and adjusting network parameters of the network model according to the first loss and the second loss so as to train the network model to obtain a pre-training model.
In some possible embodiments, in terms of fine-tuning the pre-training model to obtain a network model corresponding to the medical field after fine-tuning, the program is specifically configured to execute the following instructions:
acquiring a fourth text sequence of the medical field;
determining entities in the fourth text sequence, and acquiring at least one candidate entity corresponding to the entities from a pre-constructed entity dictionary library corresponding to the medical field;
replacing the entities in the fourth text sequence by using each candidate entity in the at least one candidate entity to obtain at least one fifth text sequence;
and respectively using each fifth text sequence in the at least one fifth text sequence to finely adjust the pre-training model to obtain a finely adjusted network model corresponding to the medical field.
Referring to fig. 5, fig. 5 is a block diagram illustrating functional units of a text error correction apparatus according to an embodiment of the present application. The text correction apparatus 500 includes: an acquisition unit 501 and a processing unit 502, wherein:
an obtaining unit 501, configured to obtain a text to be corrected in a medical field;
the processing unit is used for inputting the text to be corrected in the medical field into the network model corresponding to the medical field after fine tuning, and determining the word to be corrected in the text to be corrected;
the processing unit 502 is further configured to correct the word to be corrected through the network model to obtain a text after error correction.
In some possible embodiments, in terms of performing error correction on the word to be corrected through the network model to obtain an error-corrected text, the processing unit 502 is specifically configured to:
acquiring at least one candidate word to be corrected corresponding to the word to be corrected;
obtaining an entity type corresponding to each candidate word to be corrected in the at least one candidate word to be corrected, and coding each candidate word to be corrected and the entity type corresponding to each candidate word to be corrected through the network model to obtain a coding vector corresponding to each candidate word to be corrected;
performing fusion processing on a word vector corresponding to each word in the text to be corrected and a coding vector corresponding to each candidate word to be corrected through the network model to obtain a target characteristic vector of each candidate word to be corrected;
obtaining a score corresponding to each candidate word to be corrected according to the target feature vector of each candidate word to be corrected, wherein the score is used for representing the reasonableness of the text after the word to be corrected in the text to be corrected is replaced by using each candidate word to be corrected;
and replacing the word to be corrected in the text to be corrected by using the candidate word to be corrected with the largest score to obtain the text after error correction.
In some possible embodiments, before acquiring the text to be corrected, the acquiring unit 501 is further configured to: acquiring a first text sequence;
the processing unit 502 is further configured to determine a word to be replaced in the first text sequence, where the word to be replaced is a partial word in the first text sequence;
replacing the word to be replaced to obtain at least one second text sequence corresponding to the first text sequence;
taking each second text sequence in the at least one second text sequence as a training sample, and training the network model to obtain a pre-training model;
and fine-tuning the pre-training model to obtain a network model corresponding to the medical field after fine tuning.
In some possible embodiments, in determining the word to be replaced in the first text sequence, the processing unit 502 is specifically configured to:
generating a random sampling rate by a random function;
and sampling the first text sequence according to the random sampling rate to obtain words to be replaced in the first text sequence.
In some possible embodiments, the words to be replaced include a first word to be replaced and a second word to be replaced, and in terms of replacing the words to be replaced to obtain at least one second text sequence corresponding to the first text sequence, the processing unit 502 is specifically configured to:
randomly acquiring at least one first candidate word from a dictionary library;
obtaining at least one second candidate word corresponding to the second word to be replaced from the dictionary database, wherein each second candidate word in the at least one second candidate word is one of the following words: harmonic words, near words, shape near words and words with reversed word order corresponding to the second word to be replaced;
and replacing the first candidate word with each first candidate word in the at least one first candidate word and replacing the second candidate word with each second candidate word to obtain at least one second text sequence corresponding to the first text sequence.
In some possible embodiments, in regard to training the network model by using each of the at least one second text sequence as a training sample to obtain a pre-training model, the processing unit 502 is specifically configured to:
inputting each second text sequence into the network model to obtain a predicted result of each word in the second text sequence, wherein the predicted result of each word is used for indicating whether each predicted word is replaced or not; and a third text sequence after error correction is carried out on each second text sequence;
obtaining a first loss according to the prediction result and the real result of each word, wherein the real result of each word is used for indicating whether each word which is labeled in advance is replaced or not;
obtaining a first loss according to the prediction result and the real result of each word;
obtaining a second loss according to the third text sequence and the first text sequence;
and adjusting network parameters of the network model according to the first loss and the second loss so as to train the network model to obtain a pre-training model.
In some possible embodiments, in terms of performing fine adjustment on the pre-training model to obtain a network model corresponding to the medical field after the fine adjustment, the processing unit 502 is specifically configured to:
acquiring a fourth text sequence of the medical field;
determining entities in the fourth text sequence, and acquiring at least one candidate entity corresponding to the entities from a pre-constructed entity dictionary library corresponding to the medical field;
replacing the entities in the fourth text sequence by using each candidate entity in the at least one candidate entity to obtain at least one fifth text sequence;
and respectively using each fifth text sequence in the at least one fifth text sequence to finely adjust the pre-training model to obtain a finely adjusted network model corresponding to the medical field.
Embodiments of the present application also provide a computer storage medium, which stores a computer program, where the computer program is executed by a processor to implement part or all of the steps of any one of the text error correction methods as described in the above method embodiments.
Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any one of the text error correction methods as recited in the above method embodiments.
The text error correction device in the application may include a smart Phone (such as an Android Phone, an iOS Phone, a Windows Phone, etc.), a tablet computer, a palm computer, a notebook computer, a Mobile Internet device MID (MID for short), a wearable device, or the like. The text error correction device is merely exemplary and not exhaustive, and includes but is not limited to the text error correction device. In practical applications, the text error correction apparatus may further include: intelligent vehicle-mounted terminal, computer equipment and the like.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules referred to are not necessarily required in this application.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.
The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A text error correction method, comprising:
acquiring a text to be corrected in the medical field;
inputting the text to be corrected in the medical field into the network model corresponding to the medical field after fine tuning, and determining the word to be corrected in the text to be corrected;
and correcting the word to be corrected through the network model to obtain a text after error correction.
2. The method of claim 1, wherein the correcting the word to be corrected through the network model to obtain a corrected text comprises:
acquiring at least one candidate word to be corrected corresponding to the word to be corrected;
obtaining an entity type corresponding to each candidate word to be corrected in the at least one candidate word to be corrected, and coding each candidate word to be corrected and the entity type corresponding to each candidate word to be corrected through the network model to obtain a coding vector corresponding to each candidate word to be corrected;
performing fusion processing on a word vector corresponding to each word in the text to be corrected and a coding vector corresponding to each candidate word to be corrected through the network model to obtain a target characteristic vector of each candidate word to be corrected;
obtaining a score corresponding to each candidate word to be corrected according to the target feature vector of each candidate word to be corrected, wherein the score is used for representing the reasonableness of the text after the word to be corrected in the text to be corrected is replaced by using each candidate word to be corrected;
and replacing the word to be corrected in the text to be corrected by using the candidate word to be corrected with the largest score to obtain the text after error correction.
3. The method according to claim 1 or 2, wherein before obtaining the text to be corrected, the method further comprises:
acquiring a first text sequence;
determining words to be replaced in the first text sequence, wherein the words to be replaced are partial words in the first text sequence;
replacing the word to be replaced to obtain at least one second text sequence corresponding to the first text sequence;
taking each second text sequence in the at least one second text sequence as a training sample, and training the network model to obtain a pre-training model;
and fine-tuning the pre-training model to obtain a network model corresponding to the medical field after fine tuning.
4. The method of claim 3, wherein the determining a word to be replaced in the first text sequence comprises:
generating a random sampling rate by a random function;
and sampling the first text sequence according to the random sampling rate to obtain words to be replaced in the first text sequence.
5. The method of claim 3, wherein the words to be replaced include a first word to be replaced and a second word to be replaced, and wherein replacing the words to be replaced to obtain at least one second text sequence corresponding to the first text sequence comprises:
randomly acquiring at least one first candidate word from a dictionary library;
obtaining at least one second candidate word corresponding to the second word to be replaced from the dictionary database, wherein each second candidate word in the at least one second candidate word is one of the following words: harmonic words, near words, shape near words and words with reversed word order corresponding to the second word to be replaced;
and replacing a first word to be replaced by each first candidate word in the at least one first candidate word and replacing a second word to be replaced by each second candidate word to obtain at least one second text sequence corresponding to the first text sequence.
6. The method according to claim 5, wherein training the network model using each of the at least one second text sequence as a training sample to obtain a pre-training model comprises:
inputting each second text sequence into the network model to obtain a predicted result of each word in the second text sequence, wherein the predicted result of each word is used for indicating whether each predicted word is replaced or not; and a third text sequence after error correction is carried out on each second text sequence;
obtaining a first loss according to the prediction result and the real result of each word, wherein the real result of each word is used for indicating whether each word which is labeled in advance is replaced or not;
obtaining a first loss according to the prediction result and the real result of each word;
obtaining a second loss according to the third text sequence and the first text sequence;
and adjusting network parameters of the network model according to the first loss and the second loss so as to train the network model to obtain a pre-training model.
7. The method of claim 6, wherein the fine-tuning the pre-trained model to obtain a fine-tuned network model corresponding to the medical domain comprises:
acquiring a fourth text sequence of the medical field;
determining entities in the fourth text sequence, and acquiring at least one candidate entity corresponding to the entities from a pre-constructed entity dictionary library corresponding to the medical field;
replacing the entities in the fourth text sequence by using each candidate entity in the at least one candidate entity to obtain at least one fifth text sequence;
and respectively using each fifth text sequence in the at least one fifth text sequence to finely adjust the pre-training model to obtain a finely adjusted network model corresponding to the medical field.
8. A text correction apparatus, comprising:
the acquiring unit is used for acquiring a text to be corrected in the medical field;
the processing unit is used for inputting the text to be corrected in the medical field into the network model corresponding to the medical field after fine tuning, and determining the word to be corrected in the text to be corrected;
and the processing unit is also used for correcting the word to be corrected through the network model to obtain a text after error correction.
9. A text correction apparatus comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps in the method of any of claims 1-7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executed by a processor to implement the method according to any one of claims 1-7.
CN202011030582.XA 2020-09-27 2020-09-27 Text error correction method, device and storage medium Active CN111931490B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011030582.XA CN111931490B (en) 2020-09-27 2020-09-27 Text error correction method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011030582.XA CN111931490B (en) 2020-09-27 2020-09-27 Text error correction method, device and storage medium

Publications (2)

Publication Number Publication Date
CN111931490A true CN111931490A (en) 2020-11-13
CN111931490B CN111931490B (en) 2021-01-08

Family

ID=73334271

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011030582.XA Active CN111931490B (en) 2020-09-27 2020-09-27 Text error correction method, device and storage medium

Country Status (1)

Country Link
CN (1) CN111931490B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597753A (en) * 2020-12-22 2021-04-02 北京百度网讯科技有限公司 Text error correction processing method and device, electronic equipment and storage medium
CN113641793A (en) * 2021-08-16 2021-11-12 国网安徽省电力有限公司电力科学研究院 Retrieval system for long text matching optimization aiming at power standard
CN113887245A (en) * 2021-12-02 2022-01-04 腾讯科技(深圳)有限公司 Model training method and related device
CN114564942A (en) * 2021-09-06 2022-05-31 北京数美时代科技有限公司 Text error correction method, storage medium and device for supervision field
WO2022116445A1 (en) * 2020-12-01 2022-06-09 平安科技(深圳)有限公司 Method and apparatus for establishing text error correction model, medium and electronic device
WO2022121178A1 (en) * 2020-12-11 2022-06-16 平安科技(深圳)有限公司 Training method and apparatus and recognition method and apparatus for text error correction model, and computer device

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0031493B1 (en) * 1979-12-28 1986-03-05 International Business Machines Corporation Alpha content match prescan method and system for automatic spelling error correction
US20120166942A1 (en) * 2010-12-22 2012-06-28 Apple Inc. Using parts-of-speech tagging and named entity recognition for spelling correction
CN103154936A (en) * 2010-09-24 2013-06-12 新加坡国立大学 Methods and systems for automated text correction
CN107122346A (en) * 2016-12-28 2017-09-01 平安科技(深圳)有限公司 The error correction method and device of a kind of read statement
CN107220235A (en) * 2017-05-23 2017-09-29 北京百度网讯科技有限公司 Speech recognition error correction method, device and storage medium based on artificial intelligence
CN107305768A (en) * 2016-04-20 2017-10-31 上海交通大学 Easy wrongly written character calibration method in interactive voice
CN110110041A (en) * 2019-03-15 2019-08-09 平安科技(深圳)有限公司 Wrong word correcting method, device, computer installation and storage medium
CN110348008A (en) * 2019-06-17 2019-10-18 五邑大学 Medical text based on pre-training model and fine tuning technology names entity recognition method
CN110705262A (en) * 2019-09-06 2020-01-17 宁波市科技园区明天医网科技有限公司 Improved intelligent error correction method applied to medical skill examination report
CN111310447A (en) * 2020-03-18 2020-06-19 科大讯飞股份有限公司 Grammar error correction method, grammar error correction device, electronic equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0031493B1 (en) * 1979-12-28 1986-03-05 International Business Machines Corporation Alpha content match prescan method and system for automatic spelling error correction
CN103154936A (en) * 2010-09-24 2013-06-12 新加坡国立大学 Methods and systems for automated text correction
US20120166942A1 (en) * 2010-12-22 2012-06-28 Apple Inc. Using parts-of-speech tagging and named entity recognition for spelling correction
CN107305768A (en) * 2016-04-20 2017-10-31 上海交通大学 Easy wrongly written character calibration method in interactive voice
CN107122346A (en) * 2016-12-28 2017-09-01 平安科技(深圳)有限公司 The error correction method and device of a kind of read statement
CN107220235A (en) * 2017-05-23 2017-09-29 北京百度网讯科技有限公司 Speech recognition error correction method, device and storage medium based on artificial intelligence
CN110110041A (en) * 2019-03-15 2019-08-09 平安科技(深圳)有限公司 Wrong word correcting method, device, computer installation and storage medium
CN110348008A (en) * 2019-06-17 2019-10-18 五邑大学 Medical text based on pre-training model and fine tuning technology names entity recognition method
CN110705262A (en) * 2019-09-06 2020-01-17 宁波市科技园区明天医网科技有限公司 Improved intelligent error correction method applied to medical skill examination report
CN111310447A (en) * 2020-03-18 2020-06-19 科大讯飞股份有限公司 Grammar error correction method, grammar error correction device, electronic equipment and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022116445A1 (en) * 2020-12-01 2022-06-09 平安科技(深圳)有限公司 Method and apparatus for establishing text error correction model, medium and electronic device
WO2022121178A1 (en) * 2020-12-11 2022-06-16 平安科技(深圳)有限公司 Training method and apparatus and recognition method and apparatus for text error correction model, and computer device
CN112597753A (en) * 2020-12-22 2021-04-02 北京百度网讯科技有限公司 Text error correction processing method and device, electronic equipment and storage medium
CN113641793A (en) * 2021-08-16 2021-11-12 国网安徽省电力有限公司电力科学研究院 Retrieval system for long text matching optimization aiming at power standard
CN114564942A (en) * 2021-09-06 2022-05-31 北京数美时代科技有限公司 Text error correction method, storage medium and device for supervision field
CN113887245A (en) * 2021-12-02 2022-01-04 腾讯科技(深圳)有限公司 Model training method and related device
CN113887245B (en) * 2021-12-02 2022-03-25 腾讯科技(深圳)有限公司 Model training method and related device

Also Published As

Publication number Publication date
CN111931490B (en) 2021-01-08

Similar Documents

Publication Publication Date Title
CN111931490B (en) Text error correction method, device and storage medium
CN108829822B (en) Media content recommendation method and device, storage medium and electronic device
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN109165291B (en) Text matching method and electronic equipment
CN108932342A (en) A kind of method of semantic matches, the learning method of model and server
CN111241237B (en) Intelligent question-answer data processing method and device based on operation and maintenance service
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
CN111310440B (en) Text error correction method, device and system
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
CN114580382A (en) Text error correction method and device
CN113869044A (en) Keyword automatic extraction method, device, equipment and storage medium
CN114492363B (en) Small sample fine adjustment method, system and related device
US10915756B2 (en) Method and apparatus for determining (raw) video materials for news
CN110334186A (en) Data query method, apparatus, computer equipment and computer readable storage medium
CN111985228A (en) Text keyword extraction method and device, computer equipment and storage medium
CN111611791B (en) Text processing method and related device
CN114781651A (en) Small sample learning robustness improving method based on contrast learning
CN110659392B (en) Retrieval method and device, and storage medium
CN112270184A (en) Natural language processing method, device and storage medium
CN113569011A (en) Training method, device and equipment of text matching model and storage medium
CN112016281B (en) Method and device for generating wrong medical text and storage medium
CN111460808A (en) Synonymous text recognition and content recommendation method and device and electronic equipment
CN108304366B (en) Hypernym detection method and device
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation
CN113408287B (en) Entity identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant