CN117743857A - Text correction model training, text correction method, device, equipment and medium - Google Patents

Text correction model training, text correction method, device, equipment and medium Download PDF

Info

Publication number
CN117743857A
CN117743857A CN202311864820.0A CN202311864820A CN117743857A CN 117743857 A CN117743857 A CN 117743857A CN 202311864820 A CN202311864820 A CN 202311864820A CN 117743857 A CN117743857 A CN 117743857A
Authority
CN
China
Prior art keywords
error correction
text
sample
category
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311864820.0A
Other languages
Chinese (zh)
Inventor
张阳
马东升
蒋红宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Haitai Fangyuan High Technology Co Ltd
Original Assignee
Beijing Haitai Fangyuan High Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Haitai Fangyuan High Technology Co Ltd filed Critical Beijing Haitai Fangyuan High Technology Co Ltd
Priority to CN202311864820.0A priority Critical patent/CN117743857A/en
Publication of CN117743857A publication Critical patent/CN117743857A/en
Pending legal-status Critical Current

Links

Landscapes

  • Character Discrimination (AREA)

Abstract

The embodiment of the application relates to the technical field of text error correction, in particular to a text error correction model training method, a text error correction device, computer equipment and a storage medium. The text error correction model training method comprises the following steps: labeling each sample in the training sample set according to a preset error correction category to obtain a labeling result; the preset error correction category at least comprises: a correct category and a plurality of error categories; at least one error category is contained in one sample; determining an output sample corresponding to each input sample aiming at each input sample in the training input text error correction model; if the input sample does not contain all error categories, the loss ratio of the correct category in the calculation process of the calculation loss function is reduced until the text error correction model reaches the convergence target. The identification and label marking of the error category are carried out on the sample before the training of the text error correction model, so that error correction can be prevented, the error correction rate is reduced, and further the error correction accuracy, reliability and error correction efficiency are improved.

Description

Text correction model training, text correction method, device, equipment and medium
Technical Field
The present application relates to the field of text correction technologies, and in particular, to a text correction model training method, a text correction device, a computer device, and a storage medium.
Background
Text error types include: multiple words, fewer words, wrong words, sequential errors, etc., and the corresponding error correction modes are as follows: delete, add, replace, move, etc., remain unchanged for the correct character, and the typical text error correction method is currently the sequence To edit label (sequence to edit tag) method.
The sequence-to-edit label method defines K, D, A, R, M and other error correction labels aiming at the types of error-free (K), multiword (del, D), few words (add, A), error word (R) and sequential error (M), and converts text error correction into sequence labeling problems, so that good error correction effect is achieved. However, in practical application, the difference of the proportions of various error correction labels is very large, and especially, the error-free proportion is too high, so that the training data is difficult to construct by a sequence-to-edit label method, the model training effect is difficult to improve, and the practical application effect is very poor.
Therefore, a text error correction method with better error correction effect is needed.
Disclosure of Invention
The embodiment of the application provides a text error correction model training method, a text error correction device, computer equipment and a storage medium.
In a first aspect of the embodiments of the present application, a text error correction model training method is provided, including:
labeling each sample in the training sample set according to a preset error correction category to obtain a labeling result; the preset error correction category at least comprises: a correct category and a plurality of error categories; at least one error category is contained in one sample;
determining an output sample corresponding to each input sample in the training input text error correction model; the output sample refers to a prediction result of the text error correction model on the input sample during each training;
and for each output sample, if the input sample does not contain all error categories, reducing the loss ratio of the correct category in the calculation process of the calculation loss function until the text error correction model reaches a convergence target, and determining the current text error correction model as a target text error correction model.
In the first aspect, the embodiment of the application performs error category identification and label marking on the sample before training the text error correction model and before actual error correction, so that error correction can be prevented, error correction rate is reduced, and error correction accuracy, reliability and error correction efficiency are improved;
In the second aspect, in the training process in the embodiment of the present application, when the input sample does not include all the error categories, the loss duty ratio of the correct category in the calculation process of calculating the loss function is reduced, so that the accuracy of the error correction model can be effectively improved;
in the third aspect, in the training process, the embodiment of the application reduces the Loss duty ratio of the correct category in the calculation process of the calculation Loss function under the condition that the input sample does not contain all the error categories, namely reduces the influence of the correct sample on the model by improving the Loss function Loss of the text error correction model, and improves the model accuracy;
in the fourth aspect, the training sample set in the embodiment of the application adopts samples which are all error samples, and correct samples are not used, so that the data volume processed by the text training model can be reduced, and the training time is shortened.
In an optional embodiment of the present application, the reducing the loss ratio of the correct class in the calculation process of the calculation loss function includes:
calculating an initial loss function according to the class labeling set in the input sample and the duty ratio of each error class in the output sample;
and aiming at the initial loss function of each input sample, reducing the duty ratio coefficient of the correct category to obtain target loss.
In the training process, if the Loss function Loss of the current text error correction model is smaller than or equal to a preset threshold value, the training reaches convergence, and the current text error correction model can be determined as the target text error correction model. According to the method, the target loss is obtained by reducing the duty ratio coefficient of the correct category, the method is simple and quick, and the model training efficiency can be improved.
In an optional embodiment of the present application, the reducing the loss ratio of the correct class in the calculation process of the calculation loss function includes:
determining class labels of a plurality of correct classes adjacent to the error class in the input sample and the output sample to obtain characteristic correct class labels;
filtering out the correct class labels of the features in the input sample and the output sample.
According to the embodiment of the application, the correct category labels of the characteristics in the input sample and the output sample are filtered, so that the loss duty ratio of the wrong category labels is improved, and the accuracy of model training is further improved.
In an optional embodiment of the present application, the determining, in the input sample and the output sample, a category label of the correct category adjacent to the error category, to obtain a feature correct category label includes:
Determining class labels of a plurality of correct classes adjacent to the error class in the input sample to obtain a first feature set;
determining class labels of a plurality of correct classes adjacent to the error class in the output sample to obtain a second feature set;
and determining the intersection of the first feature set and the second feature set as the correct category label of the features.
According to the embodiment of the application, the first feature set adjacent to the error category in the input sample and the second feature set adjacent to the error category in the output sample are respectively determined in each training process, the intersection of the first feature set and the second feature set is determined to be the correct category label of the feature, the training data structure is more reasonable, the training time can be effectively shortened on the premise of guaranteeing the training effect, and the training efficiency is improved.
In an optional embodiment of the present application, after the labeling the samples in the training sample set according to the preset error correction class, the method further includes:
respectively calculating the number of class labels in each preset error correction class;
and determining the loss duty ratio of each preset error correction category according to the number of category labels in each preset error correction category and the total category label number in the training sample set.
According to the method and the device, the number of class labels in each preset error correction class is calculated respectively, then the loss duty ratio of each preset error correction class is determined according to the number of class labels in each preset error correction class and the number of class labels in the training sample set, the loss duty ratio of the sample labels in each error correction class can be determined intuitively and specifically, subsequent adjustment according to a proportion is facilitated, and adjustment is more accurate.
In an optional embodiment of the present application, before the calculating the number of class labels in each of the preset error correction classes, the method further includes:
word segmentation processing is carried out on each sample in the training sample set;
correspondingly, the labeling the samples in the training sample set according to the preset error correction category to obtain a labeling result comprises the following steps:
labeling each word in each sample in the training sample set according to the preset error correction category to obtain the labeling result.
According to the embodiment of the application, firstly, the segmentation processing is carried out on each sample in the training sample set, then, the labeling is carried out on each segmentation in each sample in the training sample set according to the preset error correction category, the subsequent statistics of the loss proportion of each error correction category can be facilitated, the statistical efficiency can be improved, and the model training efficiency is further improved.
In an alternative embodiment of the present application, the word segmentation in the word segmentation process is one or more characters.
According to the embodiment of the application, the word segmentation processing is carried out on the sample according to the characters, the word segmentation is finer, the error recognition effect is better, and the model training effect can be further improved.
In an optional embodiment of the present application, the preset error correction category includes: at least one of multiple words, few words and wrong words.
According to the embodiment of the application, the traditional disordered error labels are replaced by the error category multiple words and the error category few words, and the modification mode corresponding to disordered errors is adjusted to deletion and addition, namely, disordered error labels are cancelled, so that the calculated amount of error identification is reduced, and the model training efficiency is improved.
In a second aspect of the embodiments of the present application, a text error correction method is provided, including:
judging whether text errors exist in the text to be corrected;
and if the text to be corrected has text errors, correcting the text to be corrected through the target text correction model according to any one of the above.
According to the text error correction method provided by the embodiment of the application, the error recognition and the text error correction of the text are separated, the text is recognized in the error category and labeled before error correction, error correction can be prevented, error correction rate is reduced, and error correction accuracy, reliability and error correction efficiency are further improved.
In a third aspect of the embodiments of the present application, there is provided a text error correction model training apparatus, including:
the labeling module is used for labeling each sample in the training sample set according to the preset error correction category to obtain a labeling result; the preset error correction category at least comprises: a correct category and a plurality of error categories; at least one error category is contained in one sample;
the first training module is used for determining an output sample corresponding to each input sample in the input text error correction model during training; the output sample refers to a prediction result of the text error correction model on the input sample during each training;
and the second training module is used for reducing the loss ratio of the correct category in the calculation process of the calculation loss function if the input sample does not contain all the error categories aiming at each output sample until the text error correction model reaches a convergence target, and determining the current text error correction model as a target text error correction model.
In an optional embodiment of the present application, the second training module is specifically configured to: calculating an initial loss function according to the class labeling set in the input sample and the duty ratio of each error class in the output sample; and aiming at the initial loss function of each input sample, reducing the duty ratio coefficient of the correct category to obtain target loss.
In an optional embodiment of the present application, the second training module is specifically configured to: determining class labels of a plurality of correct classes adjacent to the error class in the input sample and the output sample to obtain characteristic correct class labels; filtering out the correct class labels of the features in the input sample and the output sample.
In an optional embodiment of the present application, the second training module is specifically configured to: determining class labels of a plurality of correct classes adjacent to the error class in the input sample to obtain a first feature set; determining class labels of a plurality of correct classes adjacent to the error class in the output sample to obtain a second feature set; and determining the intersection of the first feature set and the second feature set as the correct category label of the features.
In an alternative embodiment of the present application, the second training is further used to: respectively calculating the number of class labels in each preset error correction class; and determining the loss duty ratio of each preset error correction category according to the number of category labels in each preset error correction category and the total category label number in the training sample set.
In an alternative embodiment of the present application, the second training is further used to: word segmentation processing is carried out on each sample in the training sample set; labeling each word in each sample in the training sample set according to the preset error correction category to obtain the labeling result.
In an alternative embodiment of the present application, the word segmentation in the word segmentation process is one or more characters.
In an optional embodiment of the present application, the preset error correction category includes: at least one of multiple words, few words and wrong words.
In a fourth aspect of the embodiments of the present application, there is provided a text error correction apparatus, including:
the recognition module is used for judging whether the text to be corrected has text errors or not;
and the error correction module is used for correcting the text to be corrected through the target text error correction model according to any one of the above if the text to be corrected has text errors.
In a fifth aspect of embodiments of the present application, there is provided a computer device, comprising: comprising a memory storing a computer program and a processor implementing the steps of any of the methods described above when the processor executes the computer program.
In a sixth aspect of the embodiments of the present application, there is provided a computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any of the above.
According to the text error correction model training method provided by the embodiment of the application, sample preprocessing and error correction model training are separated, each sample in a training sample set is labeled according to a preset error correction category, then after an initial training result is obtained in the training process, if the initial training result does not contain all error categories, the category occupation ratio of the correct category in the training result is reduced, and a corrected training result is obtained; and calculating a loss function based on the comparison result of the correction training result and the labeling result until the text error correction model reaches a convergence target, and determining the current text error correction model as a target text error correction model.
In the first aspect, the embodiment of the application performs error category identification and label marking on the sample before training the text error correction model and before actual error correction, so that error correction can be prevented, error correction rate is reduced, and error correction accuracy, reliability and error correction efficiency are improved;
in the second aspect, in the embodiment of the application, the initial training result obtained by each training is adjusted in the training process, so that the duty ratio of the correct type sample in the initial training result is continuously reduced, and the accuracy of the error correction model can be effectively improved;
In a third aspect, in the embodiment of the present application, a Loss function is calculated based on comparison between an adjusted initial training result and a labeling result, that is, the influence of a correct sample on a model is reduced by improving a Loss function Loss of a text error correction model, so that the accuracy of the model is improved;
in the fourth aspect, the training sample set in the embodiment of the application adopts samples which are all error samples, and correct samples are not used, so that the data volume processed by the text training model can be reduced, and the training time is shortened.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
FIG. 1 is a flowchart of a text error correction model training method according to one embodiment of the present application;
FIG. 2 is a flowchart of a text error correction model training method according to one embodiment of the present application;
FIG. 3 is a flowchart of a text error correction model training method according to one embodiment of the present application;
FIG. 4 is a flowchart of a text error correction model training method according to one embodiment of the present application;
FIG. 5 is a flowchart of a text error correction model training method according to one embodiment of the present application;
FIG. 6 is a flow chart of a text error correction method provided in one embodiment of the present application;
fig. 7 is a schematic structural diagram of a text error correction model training device according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a text error correction device according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
In the process of implementing the present application, the inventor finds that a text error correction method with better error correction effect is needed.
Aiming at the problems, the embodiment of the application provides a text error correction model training method for improving the text error correction efficiency and the error correction effect.
The solutions in the embodiments of the present application may be implemented in various computer languages, for example, object-oriented programming language Python, java, and an interpreted script language JavaScript, etc.
In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of exemplary embodiments of the present application is given with reference to the accompanying drawings, and it is apparent that the described embodiments are only some of the embodiments of the present application and not exhaustive of all the embodiments. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.
The text error correction model training method provided by the embodiment of the application can be applied to any electronic equipment with a calculation function, such as a server, a computer, a notebook computer, a mobile phone and the like, and is not exhaustive. The text error correction model training method provided in the embodiment of the present application is described in detail below with reference to specific examples. Referring to fig. 1, the text error correction model training method provided in the embodiment of the present application includes the following steps 101 to 103:
step 101, labeling each sample in a training sample set according to a preset error correction category to obtain a labeling result;
the preset error correction category at least comprises: a correct category and a plurality of error categories; at least one error category is contained in one sample; one sample contains at least one error class. The training sample set can be flexibly adjusted according to different fields or different scenes, for example, different training sample sets are set for news fields, medical fields and the like. The number and proportion of the error categories of the samples in the training sample set can be flexibly adjusted and set according to experience statistics, and the embodiment of the application is not particularly limited. It should be explained that the training effect is improved. In this embodiment, each sample in the training sample set includes at least one error category, that is, each sample has at least one text error and is not a completely correct sample. In other words, there may be one error category in one sample, and there may be a plurality of error categories. The preset error correction category at least comprises: the correct category and the error category, the correct category refers to that the text has no error, and the error category includes, for example: the multi-word, the few words, the wrong sequence and the like are not particularly limited, and can be flexibly adjusted according to actual conditions. It should be further explained that, in the embodiment of the present application, the sample may be a sentence, or may be a word, and the label refers to labeling each character in the sentence or the word, for example: x X is the recipient's compliance, the recipient's company, the owner of the relationship. The labeling process is to label each character or word with error category, for example, the formula, respect and the relationship belong to error category, and other word or character belongs to correct category. The labeling result is a sample set or any other form containing the labeling identifications of the error category and the correct category.
102, determining an output sample corresponding to each input sample in the training input text error correction model;
the input samples are samples extracted from training samples and input to the text error correction model, and the input samples may include two sets, where the first set is a text sample set, for example: the formula xi= [ X X ] is a company in line with the person, the person is concerned with, and the second set is an error correction labeling set yi= [ k, k, k, r1, k, k, k, r2, k, k, k, k, k, r3, k, k, k ] corresponding to error categories of each word in the text sample set respectively. The output samples are the predicted results of the text error correction model on the input samples during each training, wherein the predicted results are the set of the samples after error correction category determination, and the interpretation is that when the training reaches an ideal degree, the error correction model converges; the text error correction model can be obtained based on the series of pretrained models of the Bert series, the GPT series and the like of a transducer, and is not exhaustive. The training result is the predicted result, namely,
and step 103, for each output sample, if the input sample does not contain all the error categories, reducing the loss duty ratio of the correct category in the calculation process of the calculation loss function until the text error correction model reaches a convergence target, and determining the current text error correction model as a target text error correction model.
For example, the error category includes an error category 1, an error category 2 and an error category 3, wherein one input sample includes only the error category 1 and the error category 2, and no error category 3, i.e. the error category 3 is a missing error category. The first class ratio of the correct class sample in the initial training result can be reduced according to the set proportion, the reduction mode can be a mode of reducing the correct class or a mode of increasing the error class, the embodiment of the application is not particularly limited, and the adjustment can be flexibly carried out according to actual conditions. The setting proportion can be flexibly adjusted according to the types and the number of the missing error types, the more the types and the number of the missing error types are, the higher the reduction proportion of the correct type is, and on the contrary, the less the types and the number of the missing error types are, the less the reduction proportion of the correct type is.
According to the text error correction model training method, sample preprocessing and error correction model training are separated, each sample in a training sample set is labeled according to a preset error correction category, then each input sample in the text error correction model is input during training, and an output sample corresponding to each input sample is determined; and for each output sample, if the input sample does not contain all error categories, reducing the loss ratio of the correct category in the calculation process of the calculation loss function until the text error correction model reaches a convergence target, and determining the current text error correction model as a target text error correction model.
In the first aspect, the embodiment of the application performs error category identification and label marking on the sample before training the text error correction model and before actual error correction, so that error correction can be prevented, error correction rate is reduced, and error correction accuracy, reliability and error correction efficiency are improved;
in the second aspect, in the training process in the embodiment of the present application, when the input sample does not include all the error categories, the loss duty ratio of the correct category in the calculation process of calculating the loss function is reduced, so that the accuracy of the error correction model can be effectively improved;
in the third aspect, in the training process, the embodiment of the application reduces the Loss duty ratio of the correct category in the calculation process of the calculation Loss function under the condition that the input sample does not contain all the error categories, namely reduces the influence of the correct sample on the model by improving the Loss function Loss of the text error correction model, and improves the model accuracy;
in the fourth aspect, the training sample set in the embodiment of the application adopts samples which are all error samples, and correct samples are not used, so that the data volume processed by the text training model can be reduced, and the training time is shortened.
Referring to fig. 2, in an alternative embodiment of the present application, the step 103 of reducing the loss ratio of the correct category in the calculation process of the calculated loss function includes the following steps 201 to 202:
Step 201, calculating an initial loss function according to the class labeling set in the input sample and the duty ratio of each error class in the output sample;
step 202, reducing the duty ratio coefficient of the correct class for the initial loss function of each input sample, so as to obtain a target loss.
In the present embodiment, the expression of the Loss function Loss can be exemplified as follows:
Loss = p_k*loss_k + p_d*loss_d + p_a*loss_a + p_r*loss_r (1)
wherein:
loss_k=Σyk*ln(pk)
loss_d=Σyd*ln(pd)
loss_a=Σya*ln(pa)
loss_r=Σyr*ln(pr)
in the above formula (1)), k represents a correct error correction category, d represents a multi-word error correction category, a represents a few-word error category, r represents an error correction category, loss represents a Loss function in a training process, p_k represents a duty ratio of a correct category sample, p_d, p_a, and p_r represent duty ratios of different error category samples in a total sample, loss_k represents a Loss function of a correct category sample, loss_d, loss_a, and loss_r represent Loss functions of different error category samples, respectively.
For example, the error category includes error category 1 (tag D), error category 2 (tag a), and error category 3 (tag R), and the correct category is denoted as tag K, which may occur in the following cases:
case 1: if one sample contains K, D, A and R labels, the Loss calculation is unchanged, and the formula (1) is directly used;
Case 2: if one sample contains K, D and A labels, the duty ratio coefficient of the K label is reduced in a same ratio, and pk-pr is taken as a new first category duty ratio pk, the Loss function can be calculated by adopting the following formula (2):
Loss=(1-p_r)×Loss_k+(1+p_r)×Loss_a+(1+p_r)×Loss_d(2)
the weight of the Loss function is more balanced by adjusting the weight of loss_k.
Case 3: if one sample contains K and A labels, the proportion of the K labels is reduced in a same way, and pk-pr-pd is taken as a new first category duty ratio coefficient p_k, the Loss function can be calculated by adopting the following formula (3):
Loss=(1-p_r- p_a) ×Loss_k + (1+2p_r+2p_a)×Loss_d (3)
the other cases are similar, namely missing D, A, R error type tags, the K tag ratio is reduced proportionally, and the Loss is calculated again, which is not specifically exemplified herein.
In the training process, if the Loss function Loss of the current text error correction model is smaller than or equal to a preset threshold value, the training reaches convergence, and the current text error correction model can be determined as the target text error correction model.
According to the method, the target loss is obtained by reducing the duty ratio coefficient of the correct category, the method is simple and quick, and the model training efficiency can be improved.
Referring to fig. 3, in an alternative embodiment of the present application, the step 103 of reducing the loss ratio of the correct category in the calculation process of the calculated loss function includes the following steps 301 to 302:
Step 301, determining class labels of a plurality of correct classes adjacent to the error class in the input sample and the output sample, so as to obtain feature correct class labels;
for example, the text set in the input sample is xi= [ X X formula is the company of the ward-oriented company ];
the labeling category set yi= [ k, k, k, r1, k, k, k, r2, k, k, k, k, r3, k, k, k ] in the input sample
Outputting a sample set: y_hat= [ k, a, k, rh, k, d, k, k, k, a, k, k, d, k, k, k
Where k represents the category label of the correct category and r represents the category label of the wrong category of the wrong word.
For example by labeling the category of 2 correct categories adjacent to the wrong category,
for the first error category r1, its characteristic correct category is labeled [ k, k, k, k ]; labeling r2 for a second error category, wherein the correct category is labeled [ k, k, k, k ]; the third erroneous class is labeled r3, and its characteristic correct class is labeled k, k, k, k.
Step 302, filtering out the correct class labels of the features in the input sample and the output sample.
Then a new set of labeled categories yi_new= [ k, k, -, r1, -, k, k, r2, k, k, -, -, r3, -, -, k ] in the new input sample is formed;
Then a new set of output samples y_hat_new= [ k, a, -, rh, -, d, k, k, k, a, -, -, d, -, -, k ] is formed.
According to the embodiment of the application, the correct category labels of the characteristics in the input sample and the output sample are filtered, so that the loss duty ratio of the wrong category labels is improved, and the accuracy of model training is further improved.
Referring to fig. 4, in an alternative embodiment of the present application, in the step 301, determining the category labels of the correct categories adjacent to the error category in the input sample and the output sample, to obtain feature correct category labels includes the following steps 401-403:
step 401, determining class labels of a plurality of correct classes adjacent to the error class in the input sample, and obtaining a first feature set;
step 402, determining class labels of a plurality of correct classes adjacent to the error class in the output sample, so as to obtain a second feature set;
step 403, determining an intersection of the first feature set and the second feature set as the feature correct category label.
As above, the text set in the input sample is xi= [ X X formula is the company of the recipient in line with the company of the recipient ];
The labeling category set yi= [ k, k, k, r1, k, k, k, r2, k, k, k, k, r3, k, k, k ] in the input sample
According to yi, yhat processing: for the first error label r1, the training sample set takes the front and back 2 k characters A1[ k, k, k, k ]; the input sample is rh, the front and back 2 k characters B1[ a, k, k, d ] are taken, when the Loss function Loss is calculated, k characters are used for sample and prediction, namely, A1 and B1 are combined to form an intersection, and the intersection is masked.
For the second error label r2, taking the front and back 2 k characters A [ k, k, k, k [; the prediction result yhat is k, is a non-error label, is not processed, and is not adjusted in the Loss calculation.
Taking the front and back 2 k characters A3[ k, k, k, k ] for the third error label r 3; the predicted result yi is d, 2 k characters B3[ k, k, k, k ] before and after are taken, A3 and B3 intersection is obtained, and the intersection mask is used.
Then a new set of labeled categories yi_new= [ k, k, -, r1, -, k, k, r2, k, k, -, -, r3, -, -, k ] in the new input sample is formed;
then a new set of output samples y_hat_new= [ k, a, -, rh, -, d, k, k, k, a, -, -, d, -, -, k ] is formed.
According to the embodiment of the application, the first feature set adjacent to the error category in the input sample and the second feature set adjacent to the error category in the output sample are respectively determined in each training process, the intersection of the first feature set and the second feature set is determined to be the correct category label of the feature, the training data structure is more reasonable, the training time can be effectively shortened on the premise of guaranteeing the training effect, and the training efficiency is improved.
Referring to fig. 5, in an alternative embodiment of the present application, after the labeling of each sample in the training sample set according to the preset error correction class in the step 101, the method further includes the following steps 501-502:
step 501, respectively calculating the number of class labels in each preset error correction class;
for example, one sample is: the formula X X is a company in line with the line of the recipient, and the sample words of the following table (1) are respectively "XX", "formula", "yes", "subject", "person", "line of the recipient", "subject", "line of the recipient", "company".
Watch (1)
Sample word segmentation XX Formula (VI) Is that Is subject to Human body Zunjing Is subject to Human body Main switch A kind of electronic device Company (Corp)
Error correction class Correct and correct Wrong word Correct and correct Correct and correct Correct and correct Wrong word Correct and correct Correct and correct Wrong word Correct and correct Correct and correct
Wherein, in the above sample: the total categories are marked as 11, the categories of the correct categories are marked as 8, the categories of the wrong words are marked as 3, and the categories of the few words and the multiple words are marked as 0.
Of course, the class labels are merely examples, and the word may be segmented according to unit characters, fixed-length characters, or the like, and the examples do not constitute a specific limitation on the sample class label mode.
Step 502, determining the loss ratio of each preset error correction category according to the number of category labels in each preset error correction category and the total category label number in the training sample set.
As in the example above, in this sample, the loss ratio of the correct class label is 8/11, the loss ratio of the few-word class label is 3/11, and the loss ratios of the few-word and multi-word class labels are both 0.
According to the method and the device, the number of class labels in each preset error correction class is calculated respectively, then the loss duty ratio of each preset error correction class is determined according to the number of class labels in each preset error correction class and the number of class labels in the training sample set, the loss duty ratio of the sample labels in each error correction class can be determined intuitively and specifically, subsequent adjustment according to a proportion is facilitated, and adjustment is more accurate.
In an optional embodiment of the present application, before the step 501 of calculating the number of class labels in each preset error correction class, the text error correction model training method further includes the following steps:
word segmentation processing is carried out on each sample in the training sample set;
as exemplified in the above steps, the word segmentation processing may be performed on each sample in the manner shown in the above table (1), but of course, the word segmentation may be performed in other manners, such as unit characters or fixed-length characters, and the embodiment of the present application is not particularly limited and may be specifically set according to actual situations.
Correspondingly, step 101, labeling each sample in the training sample set according to a preset error correction category to obtain a labeling result, including the following steps:
labeling each word in each sample in the training sample set according to the preset error correction category to obtain the labeling result.
Labeling each word in each sample in the training sample set according to a preset error correction category to obtain a labeling result.
For example, in table (1) above, each word segment may be labeled according to a preset error correction category.
According to the embodiment of the application, firstly, the segmentation processing is carried out on each sample in the training sample set, then, the labeling is carried out on each segmentation in each sample in the training sample set according to the preset error correction category, the subsequent statistics of the loss proportion of each error correction category can be facilitated, the statistical efficiency can be improved, and the model training efficiency is further improved.
In an alternative embodiment of the present application, the word segmentation in the word segmentation process is one or more characters.
That is, according to the embodiment of the application, word segmentation is performed on the sample according to the characters, the word segmentation is finer, the error recognition effect is better, and the model training effect can be further improved.
In an optional embodiment of the present application, the preset error correction category includes: at least one of multiple words, few words and wrong words.
According to the embodiment of the application, the traditional disordered error labels are replaced by the error category multiple words and the error category few words, and the modification mode corresponding to disordered errors is adjusted to deletion and addition, namely, disordered error labels are cancelled, so that the calculated amount of error identification is reduced, and the model training efficiency is improved.
One embodiment of the present application provides a text error correction method, including the following steps 601 to 602:
step 601, judging whether text errors exist in the text to be corrected;
the text recognition model can be obtained by training, for example, a classification language model CNN, RNN or a Bert series model based on a transducer, a GPT model and the like, and can only realize recognition of sample errors. It should be explained that the text to be corrected is a target of model application in practical application, and is not a training sample used in training in the above embodiment.
And 602, if text errors exist in the text to be corrected, correcting the text to be corrected through the target text correction model according to any one of the above steps.
According to the text error correction method provided by the embodiment of the application, the error recognition and the text error correction of the text are separated, the text is recognized in the error category and labeled before error correction, error correction can be prevented, error correction rate is reduced, and error correction accuracy, reliability and error correction efficiency are further improved.
It should be understood that, although the steps in the flowchart are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the figures may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or other steps.
Referring to fig. 7, an embodiment of the present application provides a text error correction model training apparatus 700, including: the labeling module 710, the first training module 720, and the second training module 730, wherein:
The labeling module 710 is configured to label each sample in the training sample set according to a preset error correction class, so as to obtain a labeling result; the preset error correction category at least comprises: a correct category and a plurality of error categories; at least one error category is contained in one sample;
the first training module 720 is configured to determine, for each input sample in the training input text error correction model, an output sample corresponding to each input sample; the output sample refers to a prediction result of the text error correction model on the input sample during each training;
the second training module 730 is configured to, for each of the output samples, reduce a loss ratio of the correct category in the calculation process of the calculation loss function if the input sample does not include all the error categories, until the text error correction model reaches a convergence target, and determine the current text error correction model as a target text error correction model.
In an optional embodiment of the present application, the second training module 730 is specifically configured to calculate an initial loss function according to a ratio of the class label set in the input sample to each error class in the output sample; and aiming at the initial loss function of each input sample, reducing the duty ratio coefficient of the correct category to obtain target loss.
In an optional embodiment of the present application, the second training module 730 is specifically configured to determine a plurality of class labels of the correct class, which are adjacent to the incorrect class, in the input sample and the output sample, so as to obtain a feature correct class label; filtering out the correct class labels of the features in the input sample and the output sample.
In an optional embodiment of the present application, the labeling module 710 is specifically configured to determine a class label of a plurality of the correct classes adjacent to the error class in the input sample, so as to obtain a first feature set; determining class labels of a plurality of correct classes adjacent to the error class in the output sample to obtain a second feature set; and determining the intersection of the first feature set and the second feature set as the correct category label of the features.
In an optional embodiment of the present application, the labeling module 710 is further configured to calculate the number of class labels in each of the preset error correction classes respectively; the second training module 730 is further configured to determine the loss duty ratio of each preset error correction class according to the number of class labels in each preset error correction class and the total number of class labels in the training sample set.
In an optional embodiment of the present application, the labeling module 710 is further configured to perform word segmentation on each sample in the training sample set; labeling each word in each sample in the training sample set according to the preset error correction category to obtain the labeling result.
In an alternative embodiment of the present application, the word segmentation in the word segmentation process is one or more characters.
In an optional embodiment of the present application, the preset error correction category includes: at least one of multiple words, few words and wrong words.
Referring to fig. 8, an embodiment of the present application provides a text error correction apparatus 800, including: an identification module 810, and an error correction module 820, wherein,
the identifying module 810 is configured to determine whether a text to be corrected has a text error;
the correction module 820 is configured to correct the text to be corrected by using the target text correction model according to any one of the above if the text to be corrected has text errors.
For the specific limitations of the text correction model training apparatus 700 and the text correction apparatus 800, reference may be made to the above limitations of the text correction model training method and the text correction method, and the details are not repeated here. The respective modules in the above-described text correction model training apparatus 700 and text correction apparatus 800 may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, the internal structure of which may be as shown in FIG. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text correction model training method and a text correction method as described above. Comprising the following steps: the system comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to realize any step of the text error correction model training method and the text error correction method.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, may implement any of the above text correction model training method and text correction method steps.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (10)

1. A text error correction model training method, comprising:
labeling each sample in the training sample set according to a preset error correction category to obtain a labeling result; the preset error correction category at least comprises: a correct category and a plurality of error categories; at least one error category is contained in one sample;
determining an output sample corresponding to each input sample in the training input text error correction model; the output sample refers to a prediction result of the text error correction model on the input sample during each training;
and for each output sample, if the input sample does not contain all error categories, reducing the loss ratio of the correct category in the calculation process of the calculation loss function until the text error correction model reaches a convergence target, and determining the current text error correction model as a target text error correction model.
2. The text error correction model training method of claim 1, wherein said reducing the loss duty cycle of the correct class during the calculation of the calculation loss function comprises:
calculating an initial loss function according to the class labeling set in the input sample and the duty ratio of each error class in the output sample;
and aiming at the initial loss function of each input sample, reducing the duty ratio coefficient of the correct category to obtain target loss.
3. The text error correction model training method of claim 1, wherein said reducing the loss duty cycle of the correct class during the calculation of the calculation loss function comprises:
determining class labels of a plurality of correct classes adjacent to the error class in the input sample and the output sample to obtain characteristic correct class labels;
filtering out the correct class labels of the features in the input sample and the output sample.
4. A method of training a text error correction model according to claim 3, wherein said determining a class label for a plurality of said correct classes adjacent to said incorrect class in said input sample and said output sample, comprises:
Determining class labels of a plurality of correct classes adjacent to the error class in the input sample to obtain a first feature set;
determining class labels of a plurality of correct classes adjacent to the error class in the output sample to obtain a second feature set;
and determining the intersection of the first feature set and the second feature set as the correct category label of the features.
5. The text error correction model training method of claim 1, wherein after labeling each sample in the training sample set according to a preset error correction class, the method further comprises:
respectively calculating the number of class labels in each preset error correction class;
and determining the loss duty ratio of each preset error correction category according to the number of category labels in each preset error correction category and the total category label number in the training sample set.
6. A method for text correction, comprising:
judging whether text errors exist in the text to be corrected;
if text errors exist in the text to be corrected, correcting the text to be corrected through the target text correction model according to any one of claims 1-5.
7. A text error correction model training apparatus, comprising:
the labeling module is used for labeling each sample in the training sample set according to the preset error correction category to obtain a labeling result; the preset error correction category at least comprises: a correct category and a plurality of error categories; at least one error category is contained in one sample;
the first training module is used for determining an output sample corresponding to each input sample in the input text error correction model during training; the output sample refers to a prediction result of the text error correction model on the input sample during each training;
and the second training module is used for reducing the loss ratio of the correct category in the calculation process of the calculation loss function if the input sample does not contain all the error categories aiming at each output sample until the text error correction model reaches a convergence target, and determining the current text error correction model as a target text error correction model.
8. A text error correction apparatus, comprising:
the recognition module is used for judging whether the text to be corrected has text errors or not;
the error correction module is configured to correct the text to be corrected by using the target text error correction model according to any one of claims 1 to 8 if the text to be corrected has text errors.
9. A computer device, comprising: comprising a memory and a processor, said memory storing a computer program, characterized in that the processor implements the steps of the method according to any one of claims 1 to 6 when said computer program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
CN202311864820.0A 2023-12-29 2023-12-29 Text correction model training, text correction method, device, equipment and medium Pending CN117743857A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311864820.0A CN117743857A (en) 2023-12-29 2023-12-29 Text correction model training, text correction method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311864820.0A CN117743857A (en) 2023-12-29 2023-12-29 Text correction model training, text correction method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN117743857A true CN117743857A (en) 2024-03-22

Family

ID=90283257

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311864820.0A Pending CN117743857A (en) 2023-12-29 2023-12-29 Text correction model training, text correction method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN117743857A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859919A (en) * 2019-12-02 2020-10-30 北京嘀嘀无限科技发展有限公司 Text error correction model training method and device, electronic equipment and storage medium
CN113343674A (en) * 2021-07-09 2021-09-03 北京海泰方圆科技股份有限公司 Method, device, equipment and medium for generating text error correction model training corpus
CN113743101A (en) * 2021-08-17 2021-12-03 北京百度网讯科技有限公司 Text error correction method and device, electronic equipment and computer storage medium
CN114332539A (en) * 2021-12-31 2022-04-12 深圳友一生物科技有限公司 Network training method for class unbalanced data set
US20220198137A1 (en) * 2020-12-23 2022-06-23 Beijing Baidu Netcom Science And Technology Co., Ltd. Text error-correcting method, apparatus, electronic device and readable storage medium
WO2022227207A1 (en) * 2021-04-30 2022-11-03 平安科技(深圳)有限公司 Text classification method, apparatus, computer device, and storage medium
CN115294581A (en) * 2022-08-01 2022-11-04 深圳市星桐科技有限公司 Method and device for identifying error characters, electronic equipment and storage medium
CN116136957A (en) * 2023-04-18 2023-05-19 之江实验室 Text error correction method, device and medium based on intention consistency
CN116341523A (en) * 2023-02-16 2023-06-27 平安科技(深圳)有限公司 Text error correction method, device, computer equipment and storage medium
CN117094311A (en) * 2023-10-19 2023-11-21 山东齐鲁壹点传媒有限公司 Method for establishing error correction filter for Chinese grammar error correction

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859919A (en) * 2019-12-02 2020-10-30 北京嘀嘀无限科技发展有限公司 Text error correction model training method and device, electronic equipment and storage medium
US20220198137A1 (en) * 2020-12-23 2022-06-23 Beijing Baidu Netcom Science And Technology Co., Ltd. Text error-correcting method, apparatus, electronic device and readable storage medium
WO2022227207A1 (en) * 2021-04-30 2022-11-03 平安科技(深圳)有限公司 Text classification method, apparatus, computer device, and storage medium
CN113343674A (en) * 2021-07-09 2021-09-03 北京海泰方圆科技股份有限公司 Method, device, equipment and medium for generating text error correction model training corpus
CN113743101A (en) * 2021-08-17 2021-12-03 北京百度网讯科技有限公司 Text error correction method and device, electronic equipment and computer storage medium
CN114332539A (en) * 2021-12-31 2022-04-12 深圳友一生物科技有限公司 Network training method for class unbalanced data set
CN115294581A (en) * 2022-08-01 2022-11-04 深圳市星桐科技有限公司 Method and device for identifying error characters, electronic equipment and storage medium
CN116341523A (en) * 2023-02-16 2023-06-27 平安科技(深圳)有限公司 Text error correction method, device, computer equipment and storage medium
CN116136957A (en) * 2023-04-18 2023-05-19 之江实验室 Text error correction method, device and medium based on intention consistency
CN117094311A (en) * 2023-10-19 2023-11-21 山东齐鲁壹点传媒有限公司 Method for establishing error correction filter for Chinese grammar error correction

Similar Documents

Publication Publication Date Title
US11392838B2 (en) Method, equipment, computing device and computer-readable storage medium for knowledge extraction based on TextCNN
CN110083639B (en) Intelligent data blood source tracing method and device based on cluster analysis
CN108090043B (en) Error correction report processing method and device based on artificial intelligence and readable medium
CN110580308B (en) Information auditing method and device, electronic equipment and storage medium
CN108846660B (en) Method and system for identifying abnormal fund
CN107341143A (en) A kind of sentence continuity determination methods and device and electronic equipment
CN111125658A (en) Method, device, server and storage medium for identifying fraudulent users
CN112148852A (en) Intelligent customer service method and device, storage medium and computer equipment
CN113138906A (en) Call chain data acquisition method, device, equipment and storage medium
CN117743857A (en) Text correction model training, text correction method, device, equipment and medium
CN110275880B (en) Data analysis method, device, server and readable storage medium
CN112183072A (en) Text error correction method and device, electronic equipment and readable storage medium
CN107133090A (en) A kind of method for processing business and device
CN111950237A (en) Sentence rewriting method, sentence rewriting device and electronic equipment
CN110806973A (en) Automatic generation method and device of interface message
CN110598466A (en) Offline field checking method, device and equipment and computer readable storage medium
CN115345600A (en) RPA flow generation method and device
CN115756395A (en) Multi-dimensional object data statistical method based on annotation configuration
CN111427328B (en) Method for reducing household system faults
CN112632132A (en) Method, device and equipment for processing abnormal import data
CN112733517A (en) Method for checking requirement template conformity, electronic equipment and storage medium
CN112486479A (en) Data acquisition method and device
CN112214669A (en) Home decoration material formaldehyde release data processing method and device and monitoring server
CN111352751A (en) Data file generation method and device, computer equipment and storage medium
CN111143643A (en) Element identification method and device, readable storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination