CN114861637A - Method and device for generating spelling error correction model and method and device for spelling error correction - Google Patents

Method and device for generating spelling error correction model and method and device for spelling error correction Download PDF

Info

Publication number
CN114861637A
CN114861637A CN202210546618.2A CN202210546618A CN114861637A CN 114861637 A CN114861637 A CN 114861637A CN 202210546618 A CN202210546618 A CN 202210546618A CN 114861637 A CN114861637 A CN 114861637A
Authority
CN
China
Prior art keywords
error correction
sample
text
model
sample set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210546618.2A
Other languages
Chinese (zh)
Other versions
CN114861637B (en
Inventor
马芸
桂睿
曹宇慧
黄硕
陈永锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202210546618.2A priority Critical patent/CN114861637B/en
Publication of CN114861637A publication Critical patent/CN114861637A/en
Application granted granted Critical
Publication of CN114861637B publication Critical patent/CN114861637B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The disclosure provides a spelling error correction model generation method and device, relates to the technical field of artificial intelligence, in particular to the technical fields of deep learning, natural language processing and the like, and can be applied to scenes such as OCR (optical character recognition). The specific implementation scheme is as follows: obtaining an error correction sample set comprising at least one error correction sample; based on the error correction sample set, carrying out spelling error correction training on the pre-trained text recognition model to obtain an error correction model to be adjusted; selecting a low-frequency sample comprising low-frequency words from the error correction sample set to obtain a low-frequency sample set; and performing spelling error correction training on the model to be adjusted and corrected based on the low-frequency sample set to obtain a spelling error correction model. This embodiment improves the generalization capability of the spell correction model to spelling errors.

Description

Method and device for generating spelling error correction model and method and device for spelling error correction
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of deep learning and natural language processing technologies, which may be applied to scenes such as OCR and the like, and in particular, to a method and an apparatus for generating a spelling correction model, a method and an apparatus for spelling correction, an electronic device, a computer-readable medium, and a computer program product.
Background
The spelling error correction system aims to automatically recognize misspelled words in texts and give corresponding modification suggestions based on natural language processing technology. The traditional spell correction system mostly adopts a technical route combining rule matching and a sequencing model: and (4) recalling the rule matching based on dictionary resources and editing distances, and inputting recalled candidates into a sequencing model through feature extraction to obtain scores and form an error correction result. The traditional spelling error correction technology combining the rule matching with the sequencing model excessively depends on dictionary resources and feature engineering, the labor cost is high, and the generalization capability is lacked.
Disclosure of Invention
A spell correction model generation method and apparatus, an electronic device, a computer-readable medium, and a computer program product are provided.
According to a first aspect, there is provided a method of generating a spell correction model, the method comprising: obtaining an error correction sample set comprising at least one error correction sample; based on the error correction sample set, carrying out spelling error correction training on the pre-trained text recognition model to obtain an error correction model to be adjusted; selecting a low-frequency sample comprising low-frequency words from the error correction sample set to obtain a low-frequency sample set; and performing spelling error correction training on the model to be adjusted and corrected based on the low-frequency sample set to obtain a spelling error correction model.
According to a second aspect, there is provided a method of spell correction, the method comprising: acquiring text data to be corrected; inputting the text data to be corrected into the spell correction model generated by the method described in any implementation manner of the first aspect, and obtaining the error target and the correction result of the error target in the text data to be corrected.
According to a third aspect, there is provided a spelling correction model generation apparatus, comprising: an error correction acquisition unit configured to acquire an error correction sample set including at least one error correction sample; the training unit to be adjusted is configured to perform spelling error correction training on the pre-trained text recognition model based on the error correction sample set to obtain an error correction model to be adjusted; a low frequency obtaining unit configured to select a low frequency sample including low frequency words from the error correction sample set to obtain a low frequency sample set; and the spelling training unit is configured to perform spelling error correction training on the to-be-adjusted error correction model based on the low-frequency sample set to obtain the spelling error correction model.
According to a fourth aspect, there is provided a spelling error correction apparatus comprising: a text acquisition unit configured to acquire text data to be corrected; and the obtaining unit is configured to input the text data to be corrected into the spell correction model generated by the device described in any one of the implementation manners of the third aspect, and obtain the error target in the text data to be corrected and the correction result of the error target.
According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described in any one of the implementations of the first aspect or the second aspect.
According to a sixth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method as described in any one of the implementations of the first or second aspect.
According to a seventh aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first or second aspect.
Firstly, acquiring an error correction sample set comprising at least one error correction sample; secondly, spelling error correction training is carried out on the pre-trained text recognition model based on the error correction sample set, and an error correction model to be adjusted is obtained; thirdly, selecting a low-frequency sample comprising low-frequency words from the error correction sample set to obtain a low-frequency sample set; and finally, based on the low-frequency sample set, carrying out spelling error correction training on the model to be adjusted and corrected to obtain a spelling error correction model. Therefore, the low-frequency words in the error correction sample set are adopted to fine tune the model to be adjusted, so that the comprehension capability of the spelling error correction model to the low-frequency words can be improved, the error correction phenomenon is reduced, and the generalization of the spelling error correction model and the expression of the spelling error correction model in the spelling error correction task are improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flow diagram of one embodiment of a method of generating a spell correction model according to the present disclosure;
FIG. 2 is a block diagram illustrating a structure of the generation of the spell correction model in an embodiment of the disclosure;
FIG. 3 is a flow diagram of one embodiment of a spell correction method according to the present disclosure;
FIG. 4 is a schematic structural diagram of an embodiment of a spell correction model generation apparatus according to the present disclosure;
FIG. 5 is a schematic diagram of a structure of an embodiment of a spell correction apparatus according to the present disclosure;
fig. 6 is a block diagram of an electronic device for implementing a method of generating a spell correction model and a method of spell correction according to an embodiment of the disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the present embodiment, "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature.
Fig. 1 illustrates a flow 100 of one embodiment of a method for generating a spell correction model according to the present disclosure, the method comprising the steps of:
step 101, an error correction sample set comprising at least one error correction sample is obtained.
In this embodiment, the error correction sample set is a text data set obtained by an execution subject on which the spell correction model generation method is executed in order to train the spell correction model. The execution body of the spell correction model generation method may obtain the error correction sample set in a variety of ways. For example, the execution subject may obtain the error correction sample set stored therein from the database server through a wired connection or a wireless connection. As another example, the execution body may also receive an error correction sample set acquired by the terminal or other device in real-time.
In this embodiment, the error correction sample set includes at least one error correction sample, each error correction sample may be a segment of text data, some of the wrongly written words in the segment of text data are labeled with corresponding word labels, and the word labels are correct words corresponding to the wrongly written words; optionally, some of the mispronounced words in the segment of text data are labeled with corresponding word labels, which are correct words corresponding to the mispronounced words.
Optionally, the error correction sample set comprises at least one piece of text data, each piece of text data comprising: original text and a text label corresponding to the original text, the text label being the correct text corresponding to the original text.
And 102, performing spelling error correction training on the pre-trained text recognition model based on the error correction sample set to obtain an error correction model to be adjusted.
In this embodiment, the pre-trained text recognition model is a model obtained by training through a method of training a mask language model, and the text recognition model is used for predicting content in a text.
The training process of the pre-trained text recognition model is as follows: on large-scale unmarked text data, randomly replacing a part of characters in the text data with special characters (the special characters are identified as masks to the characters by a text identification model), inputting the replaced text data and original data into a text identification network corresponding to the text identification model, obtaining a prediction result of the text identification network on the replaced text data through the coding of the text identification network, adjusting parameters of the text identification network based on the prediction result and the original data until the iterative training times of the text identification network reach a training threshold or the loss value of the text identification network reaches a loss value threshold, obtaining a text identification model, and finally predicting the original characters at special character positions by the text identification model after any replaced text is input.
In this embodiment, the network structure of the text recognition model may adopt an Ernie (Enhanced registration from Knowledge enhancement semantic Representation model) and other two-way models based on a transform structure, such as BERT (Bidirectional Encoder for transformations), ELECTRA (efficient Learning an Encoder for Accurately classifying Token substitutions), and the like.
In this embodiment, the pre-trained text recognition model does not have an error correction capability, that is, the pre-trained text recognition model does not have any error correction capability, and after the text data shielded by the mask is input, the text recognition model can only predict the mask part in the input text data to predict the characters of the mask part.
In this embodiment, the text recognition model is spell corrected and trained through the error correction sample set, and the obtained error correction model to be tuned is a model with a certain error correction capability, but the error correction capability of the error correction model to be tuned is not mature, and the error correction capability is weak.
And 103, selecting a low-frequency sample comprising a low-frequency vocabulary from the error correction sample set to obtain a low-frequency sample set.
In this embodiment, the low-frequency sample is an error correction sample with a low occurrence ratio of the error correction sample set, and the low-frequency vocabulary in the low-frequency sample has a small occurrence frequency in all the error correction samples in the error correction sample set.
And 104, performing spelling error correction training on the model to be adjusted and corrected based on the low-frequency sample set to obtain a spelling error correction model.
In this embodiment, the training step of the to-be-adjusted error correction model includes: selecting low-frequency samples from a low-frequency sample set; inputting the selected low-frequency sample into an error correction model to be adjusted, enabling the error correction model to be adjusted to encode the selected low-frequency sample, and predicting the real text of each text position in the selected low-frequency sample; thirdly, calculating a loss value of the error correction model to be adjusted based on the text predicted by the error correction model to be adjusted and the selected low-frequency sample; and step four, if the error correction model to be adjusted does not meet the training completion condition, adjusting the parameters of the error correction model to be adjusted, and continuing to execute the step one to the step four until the error correction model to be adjusted meets the training completion condition, and taking the error correction model to be adjusted as the spelling error correction model. In this embodiment, the training completion condition includes: and when the loss value of the error correction model to be adjusted reaches a certain loss threshold value or the training iteration frequency of the error correction model to be adjusted reaches a preset frequency, wherein the training iteration frequency refers to the frequency of executing the first step to the fourth step.
In the embodiment, the low-frequency sample containing the low-frequency words is selected to continue training the to-be-adjusted error correction model, so that the trained spelling error correction model can better understand the semantics of the low-frequency words, and the error correction phenomenon of the spelling error correction model is reduced.
Optionally, the method for generating a spell correction model may further include: selecting error correction samples which are easy to be wrongly corrected from the error correction sample set to obtain an error correction sample set; and training the spelling error correction model by adopting the sample set easy to correct errors to obtain a final error correction model. In this embodiment, the error correction samples in the error correction sample set are sample types in which error correction is likely to occur, such as error correction samples including a special name (e.g., a person name, a place name, etc.).
Firstly, obtaining an error correction sample set comprising at least one error correction sample; secondly, spelling error correction training is carried out on the pre-trained text recognition model based on the error correction sample set, and an error correction model to be adjusted is obtained; thirdly, selecting a low-frequency sample comprising low-frequency words from the error correction sample set to obtain a low-frequency sample set; and finally, based on the low-frequency sample set, carrying out spelling error correction training on the model to be adjusted and corrected to obtain a spelling error correction model. Therefore, the low-frequency words in the error correction sample set are adopted to fine tune the model to be adjusted, so that the comprehension capability of the spelling error correction model to the low-frequency words can be improved, the error correction phenomenon is reduced, and the generalization of the spelling error correction model and the expression of the spelling error correction model in the spelling error correction task are improved.
In some embodiments of the present disclosure, the method for generating a spell correction model further includes: in the training process of the error correction model to be adjusted, performing comparative learning on semantic representations of first target positions of low-frequency samples in the low-frequency sample set to obtain a first comparative learning loss; and adjusting parameters of the error correction model to be adjusted based on the first contrast learning loss.
In this embodiment, in the training process of the to-be-tuned error correction model, after the to-be-tuned error correction model encodes a currently input low-frequency sample, in addition to predicting a real text of each position of the low-frequency sample, a target position (a position where a part of characters are located) is randomly selected, and a contrast learning target is added to the target position, where the target is that a semantic representation of the position in the to-be-tuned error correction model is close to a preset positive sample and is far from a preset negative sample.
In this embodiment, the semantic representation may select an output of a last layer of the error correction model to be tuned, for example, when the error correction model to be tuned adopts an Ernie encoder, the semantic representation is an output of the last layer of the Ernie encoder.
In this embodiment, the low-frequency samples in the low-frequency sample set are low-frequency samples selected from the low-frequency samples in the current iterative training process of the error correction model to be tuned, the first target position of the low-frequency sample is a position where each text is located (for example, a correct text position where a word or a phrase is correct, or a text error position where a word or a phrase is incorrect), and the semantic representation of the first target position is vector representation of the text in the last layer of the error correction model to be tuned. Comparing the semantic representation of the first target position of the error correction model to be adjusted with a pre-constructed positive sample and a pre-constructed negative sample to determine a first comparison learning loss, wherein the semantic representation of the first target position is close to the positive sample and far away from the negative sample to the maximum extent in the comparison process, and when the semantic representation is optimal, the prediction result of the error correction model to be adjusted is determined to be optimal.
According to the spelling error correction model generation method provided by the embodiment, a contrast learning mechanism is introduced in the training process of the to-be-adjusted error correction model, so that the error correction phenomenon caused by insufficient learning of the to-be-adjusted error correction model on the low-frequency sample can be reduced.
In some optional implementation manners of this embodiment, the determining that the first target position is a correct text position, and the performing contrast learning on the semantic representation of the first target position of the low-frequency samples in the low-frequency sample set to obtain a first contrast learning loss includes: carrying out similarity comparison by adopting a pre-constructed positive sample and semantic representation of a text correct position to obtain a first positive similarity; carrying out similarity comparison by adopting a pre-constructed negative sample and semantic representation of a correct position of the text to obtain a first negative similarity; and calculating to obtain a first comparative learning loss based on the first positive similarity and the first negative similarity.
In this optional implementation manner, the first target position is a position where the correct text position is correct for a word or a phrase in the low frequency sample, for example, the low frequency sample is: today, the weather is hated, wherein the position of the day is the correct position of the text, and the position of the hate is the wrong position of the text. As shown in fig. 2, in the training process of the recognition model to be adjusted, a text correct position comparison learning mechanism is added, so that the reliability of the training of the spell correction model can be increased.
In the optional implementation manner, the first positive similarity is used for reflecting the similarity between the semantic representation of the correct position of the text and the positive sample, and the larger the value of the first positive similarity is, the more similar the semantic representation of the correct position of the text and the positive sample is; the first negative similarity is used for reflecting the similarity between the semantic representation of the correct position of the text and the negative sample, and the larger the value of the first negative similarity is, the more similar the semantic representation of the correct position of the text and the negative sample is.
In this optional implementation manner, the calculating a first comparative learning loss based on the first positive similarity and the first negative similarity includes: and substituting the first positive similarity and the first negative similarity into a contrast loss calculation formula to obtain a first contrast learning loss. Specifically, the comparison loss calculation formula may be as follows (1):
Figure BDA0003649769390000071
in the formula (1), a 1 Represents a first positive similarity, b 1j The first negative similarity is represented, the negative samples are samples of randomly selected positions, the negative samples are not required to be additionally searched and are directly randomly selected from the low-frequency sample set, and the negative samples do not need to repeatedly pass through an error correction model to be adjusted, so that a plurality of negative samples are arranged in the comparison loss calculation formula, correspondingly, the first negative similarity is also provided with a plurality of negative samples, the negative samples in the formula (1) are K, and K is a natural number larger than 1.
The first comparison learning loss calculation method provided by the optional implementation manner adopts the comparison learning of the correct position of the text in combination with the fine tuning of the low-frequency sample of the error correction model to be tuned, and aims to enhance the understanding of the error correction model to be tuned on low-frequency words in the correct text and reduce error correction caused by insufficient learning of the low-frequency words.
In this optional implementation, the positive sample is obtained by at least one of the following configurations: (z1) performing semantic representation of the first target position after the input low-frequency sample is cut off; (z2) performing semantic representation of the first target position after the input low-frequency sample is subjected to an additional feedforward process by utilizing the randomness of a discarding layer in the error correction model to be adjusted; (z3) adding a semantic representation of the first target location after the countering perturbation value to the word vector of the input low frequency sample. In this optional implementation manner, the input low-frequency sample refers to a low-frequency sample currently input into the error correction model to be adjusted in the current iteration training round of the error correction model to be adjusted.
In this alternative, the positive sample is a sample to which the low-frequency sample should be close, and the positive sample construction method (z1) is specifically implemented as follows: and giving a low-frequency sample, selecting x at the first target position of the low-frequency sample, truncating the low-frequency sample (the selected x cannot be truncated), taking a truncated text as a new input, and encoding by using an error correction model to be adjusted, wherein the semantic representation corresponding to the encoded x is called as a positive sample of the semantic representation corresponding to the encoded x of the original complete text.
In this optional implementation, the negative sample is obtained by at least one of the following configurations: (t1) inputting the sample of the easy-mixing label containing the real label of the first target position into the error correction model to be adjusted to obtain the semantic representation of the position of the easy-mixing label; (t2) obtaining semantic representations of the random positions of the other random samples. In this optional implementation, the other random samples are completely different from the low-frequency samples currently input into the to-be-tuned error correction model in the current iteration training round.
In this alternative implementation, the negative examples refer to examples from which the semantic representation of the text of the selected first target location should be kept away. In the negative example construction method, (t1) it is assumed that a piece of random input text (different from the current input text) should be unrelated to the current input text, so that any labeled semantic representation of the random sample after model encoding can be used as a negative example. In a specific practice, since the model training is performed in units of batches (batch), other samples in the same batch as the current input sample can be directly selected as random samples.
In some optional implementations of the present implementation, the error correction sample set includes: the method comprises the following steps of performing spelling error correction training on a pre-trained text recognition model based on an error correction sample set to obtain an error correction model to be adjusted, wherein the pseudo error correction sub-sample set comprises: carrying out spelling error correction training on the text recognition model by adopting a pseudo error correction sub-sample set to obtain an initial error correction model; and performing spelling error correction training on the initial error correction model by adopting a true error correction sub-sample set to obtain an error correction model to be adjusted.
In this embodiment, the pseudo-error correction subsample set includes at least one pseudo-error correction sample, and the pseudo-error correction sample is a sample obtained by replacing and constructing an error correction sample with a characteristic of a phonetic approximation and/or a shape approximation of a word. The true error correction subsample set comprises at least one true error correction sample, and the true error correction sample is a true error correction sample which is marked artificially. It should be noted that the true error correction subsample set may be a manually labeled subsample set obtained in real time from a third party.
In the optional implementation mode, the text recognition model hardly has error correction capability, and the initial error correction model obtained by performing spelling error correction training on the text recognition model by adopting the pseudo error correction subsample set has weaker error correction capability at this time, that is, the error correction capability is not mature. The reason for the weakness is: training data is automatically generated instead of real data, and the real data and the automatically generated data are different, so that a model trained on the automatically generated data only cannot completely generalize the capability of the model to a real data scene.
Furthermore, the true error correction subsample set is adopted to carry out spelling error correction training on the initial error correction model, and the obtained error correction model to be modulated can be completely generalized to a real data scene, so that the error correction capability of the error correction model to be modulated is improved.
As shown in fig. 2, the spell correction model training process includes three phases, wherein the first two phases of the three phases are the to-be-tuned error correction model training process:
the first stage is as follows: in the stage, a pseudo error correction sub-sample set is constructed by utilizing sound/shape approximate replacement, the pre-trained text recognition model is subjected to spelling error correction training, the training target is the real text of each position of the predicted pseudo error correction sample, and finally the initial error correction model is generated for the second stage training.
And a second stage: in the stage, a true error correction sub-sample set is constructed by utilizing the artificially marked true pairs, the initial error correction model obtained in the stage 1 is subjected to spelling error correction training again, the training target is the same as that in the first stage, and the loss value is calculated by adopting the same loss function as that in the first stage. In the stage, because real data is adopted, after the stage, the error correction capability of the initial error correction model facing the real data is enhanced, and the stage finally generates the error correction model to be adjusted for the training of the third stage.
And a third stage: in the stage, a sample containing low-frequency words in the true error correction subsample set is selected, the recognition model to be adjusted obtained in the stage 2 is subjected to fine adjustment, the training target is the same as that in the first stage, loss functions the same as those in the first stage are adopted for loss value calculation, and finally the spelling error correction model is obtained through training.
According to the training method for the error correction model to be adjusted, which is provided by the optional implementation mode, the text recognition model is trained through the pseudo error correction sub-sample set at one stage to obtain an initial error correction model; and in the other stage, the initial error correction model is trained through the true error correction sub-sample set to obtain the error correction model to be modulated, so that the error correction capability and the generalization capability of the error correction model to be modulated are improved.
In some embodiments of the present disclosure, the method for generating a spell correction model further includes: in the training process of the text recognition model and the initial error correction model, performing comparative learning on semantic representation of a second target position in the error correction sample to obtain a second comparative learning loss; parameters of the text recognition model and the initial error correction model are adjusted based on the second comparative learning loss.
In this embodiment, in each iterative training process of the text recognition model, after the text recognition model encodes pseudo error correction samples in a currently input pseudo error correction subsample set, a target position (a position where a part of characters are located) is randomly selected in addition to a real text of each position of the pseudo error correction sample to be predicted, and a contrast learning target is added to the target position, where the semantic representation of the position in the error correction model to be adjusted is close to a preset positive sample and is far from a preset negative sample.
In each iterative training process of the initial error correction model, after the actual error correction samples in the current input actual error correction subsample set are coded by the initial error correction model, besides the actual text of each position of the actual error correction samples, target positions (positions where a part of characters are located) are randomly selected, a contrast learning target is added to the target positions, and the target is that the semantic representation of the position in the error correction model to be adjusted is close to the preset positive samples and is far away from the preset negative samples.
According to the spelling error correction model generation method provided by the embodiment, a contrast learning mechanism is introduced in the training process of the text recognition model to be recognized and the initial error correction model, so that the generalization capability of the model on the spelling error can be improved, and the error correction missing phenomenon is reduced.
In some optional implementation manners of this embodiment, the step of performing contrast learning on the semantic representation of the second target position in the error correction sample to obtain a second contrast learning loss includes: carrying out similarity comparison by adopting a pre-constructed positive sample and semantic representation of a text error position to obtain a second positive similarity; carrying out similarity comparison by adopting a pre-constructed negative sample and semantic representation of a text error position to obtain a second negative similarity; and calculating to obtain a second comparison learning loss based on the second positive similarity and the second negative similarity.
In this optional implementation manner, the second target position is a text error position, which refers to a position where a character or word error in the error correction sample set is incorrect, and when the error correction sample set includes: the pseudo error correction subsample set and the true error correction subsample set, and the second target position is a position of a character or word error in the pseudo error correction sample and a position of a character or word error in the true error correction sample, as shown in fig. 2.
In this optional implementation manner, the calculating a second comparison learning loss based on the second positive similarity and the second negative similarity includes: and substituting the second positive similarity and the second negative similarity into a comparison loss calculation formula to obtain a second comparison learning loss. Specifically, the comparison loss calculation formula may be as follows (2):
Figure BDA0003649769390000111
in the formula (1), a 2 Represents a second positive similarity, b 2j And (2) expressing a second negative similarity, wherein the negative sample is a sample at a randomly selected position, the negative sample is directly randomly selected from the low-frequency sample set without being additionally searched, and the negative sample does not need to repeatedly pass through a text recognition model or an initial error correction model, so that a plurality of negative samples are provided in the comparison loss calculation formula, and correspondingly, the second negative similarity is also provided with a plurality of negative samples, wherein K is a natural number greater than 1 in the formula (1).
In the second comparison learning loss calculation method provided by the optional implementation manner, the comparison learning of the semantic representation of the text error position is used as an auxiliary task for pre-training the text recognition model by the pseudo error correction subsample set and error correction fine-tuning the initial error correction subsample set by the true error correction subsample set, so that the robustness of the model to errors is improved, and error correction omission caused by context change is reduced.
In some optional implementations of this embodiment, the obtaining of the pseudo error correction subsample set includes: obtaining an initial text sample set; determining a replacement word which is similar to the word or the word pronunciation or the shape of each text sample in the initial text sample set; and replacing the characters or words of each text sample in the initial text sample set by the replacement words to obtain a pseudo error correction subsample set.
Optionally, the step of obtaining the pseudo error correction subsample set may further be as follows: obtaining an initial text sample set; determining alternative words which are similar to the characters or the word pronunciation and the shape of each text sample in the initial text sample set; and replacing the characters or words of each text sample in the initial text sample set by the replacement words to obtain a pseudo error correction subsample set.
According to the method for obtaining the pseudo error correction subsample set, after the initial text sample set is obtained, the characters or words of the text samples in the initial text sample set are replaced by the replacement words which are similar to the characters or words of the text samples in the text sample sets in tone or shape, so that the pseudo error correction subsample set can be expanded to the maximum extent, and the effect of sample data enhancement is achieved.
In some alternative implementations of the present implementation, the positive sample is constructed by at least one of: performing semantic representation of a second target position after the input error correction sample is cut off; performing semantic representation of a second target position after performing a feedforward process on the input error correction sample for one additional time by using the randomness of a discarding layer in the model; and adding the semantic representation of the second target position after the disturbance resisting value to the word vector of the input error correction sample.
In this alternative implementation, the models may be a text recognition model and an initial error correction model. When the text recognition model is trained, the randomness of a discarding layer in the text recognition model can be utilized to carry out semantic representation of a second target position after the input error correction sample is subjected to a feedforward process for one additional time; when the initial error correction model is trained, the randomness of a discarding layer in the initial error correction model can be utilized to perform semantic representation of the second target position after the input error correction sample is subjected to a feedforward process for one additional time. In this embodiment, the countermeasure disturbance value is an arbitrary value that can be added to the word vector.
In this optional implementation, the positive sample may be constructed in a manner that refers to the positive sample construction manner corresponding to the low-frequency sample in the above embodiment.
The positive sample construction method provided by the optional implementation mode adopts multiple modes to realize the positive sample, and the diversity of positive sample acquisition is improved.
In some optional implementations of this embodiment, the negative examples are constructed by at least one of: inputting a sample of the easy-mixing label containing the real label of the second target position into the model to obtain the semantic representation of the position of the easy-mixing label; and acquiring semantic representations of random positions of other random samples.
In this optional implementation, the negative sample may be constructed in a manner that refers to the negative sample construction manner corresponding to the low-frequency sample in the above embodiment.
According to the negative sample construction method provided by the optional implementation mode, the negative sample is realized in multiple modes, and the diversity of negative sample acquisition is improved.
FIG. 3 illustrates a flow chart 300 of one embodiment of the disclosed spell correction method, which includes the steps of:
step 301, text data to be corrected is obtained.
In this embodiment, the text data to be corrected is text data to be detected, and the text or words in the text data to be corrected may be partially correct or may be completely correct. The execution body of the spell correction method may acquire text data to be corrected in various ways. For example, the execution main body may obtain the text data to be corrected stored therein from the database server by a wired connection manner or a wireless connection manner. For another example, the execution body may also receive text data to be corrected, which is collected by the terminal or other devices in real time.
In this embodiment, the text data to be corrected may be text data of one text, or may be text data of multiple texts, and the format of the text data to be corrected is not limited in this disclosure.
Step 302, inputting the text data to be corrected into the spell correction model generated by the spell correction model generation method, and obtaining the error target in the text data to be corrected and the correction result of the error target.
In this embodiment, the execution subject may input the text data to be corrected acquired in step 301 into the spell correction model, so as to obtain the error target output by the spell correction model and the correction result of the error target. The error target is an error character or an error word in the text data to be corrected, and when the error target is the error character, the correction result of the error target is a correct character corresponding to the error character; when the error target is an error word, the correction result of the error target is a correct word corresponding to the error word. Optionally, the result of correcting the error target may further include: and position information of the correct characters or the correct words corresponding to the error targets (coordinates of the correct characters or the correct words in the text data to be corrected, and the like).
In this embodiment, the spell correction model may be generated using the method described above in the embodiment of FIG. 1. For a specific generation process, reference may be made to the related description of the embodiment in fig. 1, which is not described herein again.
It should be noted that the spell correction method of the present embodiment may be used to test the spell correction model generated by the foregoing embodiments. And further, the spell correction model can be continuously optimized according to the error target and the correction result of the error target. The method may also be a practical application method of the spell correction model generated in the above embodiments. The spell correction model generated by the embodiments is adopted to perform spell correction, which is beneficial to improving the correctness of text data to be corrected and improving the reliability of text editing rules.
The spelling error correction method provided by the embodiment obtains the text data to be corrected, inputs the text data to be corrected into the pre-trained spelling error correction model, can effectively identify the error target in the text data to be corrected and the correction result of the error target, and improves the identification efficiency of the error target.
With further reference to fig. 4, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a spell correction model generation apparatus, which corresponds to the embodiment of the method shown in fig. 1, and which is particularly applicable in various electronic devices.
As shown in fig. 4, the spelling correction model generation apparatus 400 according to the present embodiment includes: an error correction acquisition unit 401, a training unit to be adjusted 402, a low frequency acquisition unit 403, and a spelling training unit 404. The error correction obtaining unit 401 may be configured to obtain an error correction sample set including at least one error correction sample. The to-be-tuned training unit 402 may be configured to perform spell error correction training on the pre-trained text recognition model based on the error correction sample set, so as to obtain the to-be-tuned error correction model. The low frequency obtaining unit 403 may be configured to select a low frequency sample including a low frequency vocabulary from the error correction sample set, and obtain a low frequency sample set. The spelling training unit 404 may be configured to perform spelling error correction training on the model to be adjusted based on the low frequency sample set, resulting in a spelling error correction model.
In the present embodiment, the spelling correction model generation apparatus 400: the detailed processing and the technical effects of the error correction obtaining unit 401, the training to be adjusted unit 402, the low frequency obtaining unit 403, and the spelling training unit 404 can refer to the related descriptions of step 101, step 102, step 103, and step 104 in the corresponding embodiment of fig. 1, which are not described herein again.
In some optional implementations of this embodiment, the apparatus 400 further includes: a first contrast learning unit (not shown), a first adjusting unit (not shown). The first contrast learning unit may be configured to perform contrast learning on semantic representations of first target positions of the low-frequency samples in the low-frequency sample set in a training process of the to-be-tuned error correction model, so as to obtain a first contrast learning loss. The first adjusting unit may be configured to adjust a parameter of the error correction model to be adjusted based on the first contrast learning loss.
In some optional implementations of this embodiment, the first target position is a text correct position, and the first contrast learning unit is further configured to: carrying out similarity comparison by adopting a pre-constructed positive sample and semantic representation of a text correct position to obtain a first positive similarity; carrying out similarity comparison by adopting a pre-constructed negative sample and semantic representation of a correct position of the text to obtain a first negative similarity; and calculating to obtain a first comparative learning loss based on the first positive similarity and the first negative similarity.
In some optional implementations of this embodiment, the error correction sample set includes: the pseudo error correction subsample set and the true error correction subsample set, the training to tune unit 402 is further configured to: carrying out spelling error correction training on the text recognition model by adopting a pseudo error correction sub-sample set to obtain an initial error correction model; and performing spelling error correction training on the initial error correction model by adopting a true error correction sub-sample set to obtain an error correction model to be adjusted.
In some optional implementations of this embodiment, the apparatus 400 further includes: a second contrast learning unit (not shown), and a second adjusting unit (not shown). The second comparison learning unit may be configured to perform comparison learning on the semantic representation of the second target position in the error correction sample in the training process of the text recognition model and the initial error correction model, so as to obtain a second comparison learning loss. The second adjusting unit may be configured to adjust parameters of the text recognition model and the initial error correction model based on the second comparison learning loss.
In some optional implementations of the present embodiment, the second comparison learning unit is further configured to: carrying out similarity comparison by adopting a pre-constructed positive sample and semantic representation of a text error position to obtain a second positive similarity; carrying out similarity comparison by adopting a pre-constructed negative sample and semantic representation of a text error position to obtain a second negative similarity; and calculating to obtain a second comparison learning loss based on the second positive similarity and the second negative similarity.
In some optional implementations of this embodiment, the pseudo error correction subsample set is obtained by using a sample aggregation unit (not shown in the figure); the sample aggregation unit may be configured to: obtaining an initial text sample set; determining a replacement word which is similar to the word or the word pronunciation or the shape of each text sample in the initial text sample set; and replacing the characters or words of each text sample in the initial text sample set by the replacement words to obtain a pseudo error correction subsample set.
In some optional implementations of the present embodiment, the positive sample is obtained by at least one of the following unit configurations: a truncation unit (not shown), a feed forward unit (not shown), and an addition unit (not shown). The truncation unit may be configured to perform semantic representation of the second target position after the input error correction sample is truncated. The feedforward unit may be configured to perform an additional feedforward process on the input error correction sample by using randomness of a discarding layer in the model, and then obtain a semantic representation of the target position; the adding unit may be configured to add the semantic representation of the second target position after the disturbance resisting value to the word vector of the input error correction sample.
In some optional implementations of the present embodiment, the negative example is constructed by at least one of the following: an input unit (not shown in the figure) and a random obtaining unit (not shown in the figure), wherein the input unit may be configured to input a sample of the confusing label including the second target position true label into the model, and obtain the semantic representation of the position of the confusing label. The random acquisition unit may be configured to acquire semantic representations of random positions of other random samples.
First, an error correction obtaining unit 401 obtains an error correction sample set including at least one error correction sample; secondly, the to-be-modulated training unit 402 performs spelling error correction training on the pre-trained text recognition model based on the error correction sample set to obtain the to-be-modulated error correction model; thirdly, the low frequency obtaining unit 403 selects a low frequency sample including low frequency words from the error correction sample set to obtain a low frequency sample set; finally, the spelling training unit 404 performs spelling error correction training on the model to be adjusted and corrected based on the low-frequency sample set, so as to obtain a spelling error correction model. Therefore, the low-frequency words in the error correction sample set are adopted to fine tune the model to be adjusted, so that the comprehension capability of the spelling error correction model to the low-frequency words can be improved, the error correction phenomenon is reduced, and the generalization of the spelling error correction model and the expression of the spelling error correction model in the spelling error correction task are improved.
With continuing reference to FIG. 5, the present application provides one embodiment of a spell correction mechanism as an implementation of the method illustrated in FIG. 3 above. The embodiment of the device corresponds to the embodiment of the method shown in fig. 3, and the device can be applied to various electronic devices.
As shown in fig. 5, the spelling error correction apparatus 500 of the present embodiment may include: a text acquisition unit 501 configured to acquire text data to be corrected. The result obtaining unit 502 is configured to input the text data to be corrected into the spell correction model generated by the apparatus as described in the embodiment of fig. 4, and obtain the error targets in the text data to be corrected and the correction results of the error targets.
It will be understood that the elements described in the apparatus 500 correspond to various steps in the method described with reference to fig. 3. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 500 and the units included therein, and are not described herein again.
In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above, such as a spell correction model generation method, a spell correction method. For example, in some embodiments, the spell correction model generation method, the spell correction method, may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM602 and/or the communication unit 609. When the computer program is loaded into the RAM603 and executed by the computing unit 601, one or more steps of the above-described spell correction model generation method, the spell correction method may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the spell correction model generation method, the spell correction method, by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable spell correction model generation apparatus, spell correction apparatus, or the like, such that the program codes, when executed by the processor or controller, cause the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (23)

1. A method of generating a spell correction model, the method comprising:
obtaining an error correction sample set comprising at least one error correction sample;
based on the error correction sample set, carrying out spelling error correction training on the pre-trained text recognition model to obtain an error correction model to be adjusted;
selecting a low-frequency sample comprising low-frequency words from the error correction sample set to obtain a low-frequency sample set;
and performing spelling error correction training on the error correction model to be adjusted based on the low-frequency sample set to obtain a spelling error correction model.
2. The method of claim 1, further comprising:
in the training process of the error correction model to be adjusted, performing comparative learning on semantic representations of first target positions of the low-frequency samples in the low-frequency sample set to obtain a first comparative learning loss;
and adjusting parameters of the error correction model to be adjusted based on the first contrast learning loss.
3. The method of claim 2, wherein the first target position is a text correct position, and the performing contrast learning on the semantic representation of the first target position of the low-frequency samples in the low-frequency sample set to obtain a first contrast learning loss comprises:
carrying out similarity comparison by adopting a pre-constructed positive sample and the semantic representation of the correct position of the text to obtain a first positive similarity;
carrying out similarity comparison by adopting a pre-constructed negative sample and the semantic representation of the correct position of the text to obtain a first negative similarity;
calculating the first comparative learning loss based on the first positive similarity and the first negative similarity.
4. The method of claim 1, wherein the error correction sample set comprises: the method comprises the following steps of performing spelling error correction training on a pre-trained text recognition model based on an error correction sample set to obtain an error correction model to be adjusted, wherein the spelling error correction training comprises the following steps:
carrying out spelling error correction training on the text recognition model by adopting the pseudo error correction sub-sample set to obtain an initial error correction model;
and performing spelling error correction training on the initial error correction model by adopting the true error correction sub-sample set to obtain an error correction model to be adjusted.
5. The method of claim 4, further comprising:
in the training process of the text recognition model and the initial error correction model, performing comparative learning on semantic representation of a second target position in an error correction sample to obtain a second comparative learning loss;
adjusting parameters of the text recognition model and the initial error correction model based on the second comparative learning loss.
6. The method of claim 5, wherein the second target position is a text error position, and the performing contrast learning on the semantic representation of the second target position in the error correction sample to obtain a second contrast learning loss comprises:
carrying out similarity comparison by adopting a pre-constructed positive sample and the semantic representation of the text error position to obtain a second positive similarity;
carrying out similarity comparison by adopting a pre-constructed negative sample and the semantic representation of the text error position to obtain a second negative similarity;
and calculating the second comparison learning loss based on the second positive similarity and the second negative similarity.
7. The method of claim 4, wherein the pseudo error corrected subsample set is obtained by:
obtaining an initial text sample set;
determining a replacement word which is similar to the word or word pronunciation or shape of each text sample in the initial text sample set;
and replacing the characters or words of each text sample in the initial text sample set by the replacement words to obtain a pseudo error correction subsample set.
8. The method of claim 6, wherein the positive sample is constructed by at least one of: performing semantic representation of a second target position after the input error correction sample is cut off; performing semantic representation of a second target position after performing a feedforward process on the input error correction sample for one additional time by using the randomness of a discarding layer in the model; and adding the semantic representation of the second target position after the disturbance resisting value to the word vector of the input error correction sample.
9. The method of claim 6, wherein the negative examples are constructed by at least one of: inputting a sample of the easy-mixing label containing the real label of the second target position into the model to obtain the semantic representation of the position of the easy-mixing label; and acquiring semantic representations of random positions of other random samples.
10. A method of spell correction, the method comprising:
acquiring text data to be corrected;
inputting the text data to be corrected into a spelling correction model generated by the method of any one of claims 1 to 9, and obtaining the error target in the text data to be corrected and the correction result of the error target.
11. An apparatus for generating a spell correction model, the apparatus comprising:
an error correction acquisition unit configured to acquire an error correction sample set including at least one error correction sample;
the training unit to be adjusted is configured to perform spelling error correction training on the pre-trained text recognition model based on the error correction sample set to obtain an error correction model to be adjusted;
a low frequency obtaining unit configured to select a low frequency sample including low frequency words from the error correction sample set, resulting in a low frequency sample set;
and the spelling training unit is configured to perform spelling error correction training on the error correction model to be adjusted based on the low-frequency sample set to obtain a spelling error correction model.
12. The apparatus of claim 11, the apparatus further comprising:
the first comparison learning unit is configured to perform comparison learning on semantic representations of first target positions of low-frequency samples in the low-frequency sample set in the training process of the to-be-adjusted error correction model to obtain a first comparison learning loss;
a first adjusting unit configured to adjust parameters of the error correction model to be adjusted based on the first contrast learning loss.
13. The apparatus of claim 12, wherein the first target location is a text correct location, the first contrast learning unit further configured to: carrying out similarity comparison by adopting a pre-constructed positive sample and the semantic representation of the correct position of the text to obtain a first positive similarity; carrying out similarity comparison by adopting a pre-constructed negative sample and the semantic representation of the correct position of the text to obtain a first negative similarity; calculating the first comparative learning loss based on the first positive similarity and the first negative similarity.
14. The apparatus of claim 12, wherein the error correction sample set comprises: a set of pseudo error correction subsamples and a set of true error correction subsamples, the training unit to be tuned further configured to: carrying out spelling error correction training on the text recognition model by adopting the pseudo error correction sub-sample set to obtain an initial error correction model; and performing spelling error correction training on the initial error correction model by adopting the true error correction sub-sample set to obtain an error correction model to be adjusted.
15. The apparatus of claim 14, the apparatus further comprising:
the second comparison learning unit is configured to perform comparison learning on semantic representations of a second target position in an error correction sample in the training process of the text recognition model and the initial error correction model to obtain a second comparison learning loss;
a second adjusting unit configured to adjust parameters of the text recognition model and the initial error correction model based on the second contrast learning loss.
16. The apparatus of claim 15, the second contrast learning unit further configured to: carrying out similarity comparison by adopting a pre-constructed positive sample and the semantic representation of the text error position to obtain a second positive similarity; carrying out similarity comparison by adopting a pre-constructed negative sample and the semantic representation of the text error position to obtain a second negative similarity; and calculating the second comparison learning loss based on the second positive similarity and the second negative similarity.
17. The apparatus of claim 14, wherein the pseudo error corrected subsample set is obtained using a sample aggregation unit configured to: obtaining an initial text sample set; determining a replacement word which is similar to the word or word pronunciation or shape of each text sample in the initial text sample set; and replacing the characters or words of each text sample in the initial text sample set by the replacement words to obtain a pseudo error correction subsample set.
18. The apparatus of claim 16, wherein the positive sample is constructed by at least one of: the truncation unit is configured to perform semantic representation of the truncated second target position on the input error correction sample; the feedforward unit is configured to perform an additional feedforward process on the input error correction sample by utilizing the randomness of a discarding layer in the model to obtain semantic representation of a target position; and the adding unit is configured to add the semantic representation of the second target position after the disturbance resisting value to the word vector of the input error correction sample.
19. The apparatus of claim 16, wherein the negative examples are constructed by at least one of: the input unit is configured to input a sample of the easy-mixing label containing the real label of the second target position into the model to obtain the semantic representation of the position of the easy-mixing label; and the random acquisition unit is configured to acquire semantic representation of random positions of other random samples.
20. A spelling error correction apparatus, said apparatus comprising:
a text acquisition unit configured to acquire text data to be corrected;
a result obtaining unit configured to input the text data to be corrected into a spell correction model generated by the apparatus according to any one of claims 11-19, and obtain an error target in the text data to be corrected and a correction result of the error target.
21. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.
22. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.
23. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1-10.
CN202210546618.2A 2022-05-18 2022-05-18 Spelling error correction model generation method and device, and spelling error correction method and device Active CN114861637B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210546618.2A CN114861637B (en) 2022-05-18 2022-05-18 Spelling error correction model generation method and device, and spelling error correction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210546618.2A CN114861637B (en) 2022-05-18 2022-05-18 Spelling error correction model generation method and device, and spelling error correction method and device

Publications (2)

Publication Number Publication Date
CN114861637A true CN114861637A (en) 2022-08-05
CN114861637B CN114861637B (en) 2023-06-16

Family

ID=82638684

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210546618.2A Active CN114861637B (en) 2022-05-18 2022-05-18 Spelling error correction model generation method and device, and spelling error correction method and device

Country Status (1)

Country Link
CN (1) CN114861637B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114997148A (en) * 2022-08-08 2022-09-02 湖南工商大学 Chinese spelling proofreading pre-training model construction method based on contrast learning
CN116306598A (en) * 2023-05-22 2023-06-23 上海蜜度信息技术有限公司 Customized error correction method, system, equipment and medium for words in different fields

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926306A (en) * 2021-03-08 2021-06-08 北京百度网讯科技有限公司 Text error correction method, device, equipment and storage medium
CN113627158A (en) * 2021-07-02 2021-11-09 南京理工大学 Chinese spelling error correction method and device based on multiple characteristics and multiple pre-training models
CN113837370A (en) * 2021-10-20 2021-12-24 北京房江湖科技有限公司 Method and apparatus for training a model based on contrast learning
CN113947072A (en) * 2021-10-15 2022-01-18 上海水滴征信服务有限公司 Text error correction method and text error correction device
CN114387602A (en) * 2022-03-24 2022-04-22 北京智源人工智能研究院 Medical OCR data optimization model training method, optimization method and equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926306A (en) * 2021-03-08 2021-06-08 北京百度网讯科技有限公司 Text error correction method, device, equipment and storage medium
CN113627158A (en) * 2021-07-02 2021-11-09 南京理工大学 Chinese spelling error correction method and device based on multiple characteristics and multiple pre-training models
CN113947072A (en) * 2021-10-15 2022-01-18 上海水滴征信服务有限公司 Text error correction method and text error correction device
CN113837370A (en) * 2021-10-20 2021-12-24 北京房江湖科技有限公司 Method and apparatus for training a model based on contrast learning
CN114387602A (en) * 2022-03-24 2022-04-22 北京智源人工智能研究院 Medical OCR data optimization model training method, optimization method and equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114997148A (en) * 2022-08-08 2022-09-02 湖南工商大学 Chinese spelling proofreading pre-training model construction method based on contrast learning
CN116306598A (en) * 2023-05-22 2023-06-23 上海蜜度信息技术有限公司 Customized error correction method, system, equipment and medium for words in different fields
CN116306598B (en) * 2023-05-22 2023-09-08 上海蜜度信息技术有限公司 Customized error correction method, system, equipment and medium for words in different fields

Also Published As

Publication number Publication date
CN114861637B (en) 2023-06-16

Similar Documents

Publication Publication Date Title
US20210397780A1 (en) Method, device, and storage medium for correcting error in text
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN113129870B (en) Training method, device, equipment and storage medium of speech recognition model
CN112528655B (en) Keyword generation method, device, equipment and storage medium
CN114861637A (en) Method and device for generating spelling error correction model and method and device for spelling error correction
CN113407698B (en) Method and device for training and recognizing intention of intention recognition model
CN116152833B (en) Training method of form restoration model based on image and form restoration method
CN112560846B (en) Error correction corpus generation method and device and electronic equipment
CN112559885A (en) Method and device for determining training model of map interest point and electronic equipment
CN114724168A (en) Training method of deep learning model, text recognition method, text recognition device and text recognition equipment
JP2023025126A (en) Training method and apparatus for deep learning model, text data processing method and apparatus, electronic device, storage medium, and computer program
CN115358243A (en) Training method, device, equipment and storage medium for multi-round dialogue recognition model
CN115359323A (en) Image text information generation method and deep learning model training method
CN114048733A (en) Training method of text error correction model, and text error correction method and device
CN114492426A (en) Sub-word segmentation method, model training method, device and electronic equipment
CN114973279B (en) Training method and device for handwritten text image generation model and storage medium
CN116662484A (en) Text regularization method, device, equipment and storage medium
CN114490969B (en) Question and answer method and device based on table and electronic equipment
CN115565186A (en) Method and device for training character recognition model, electronic equipment and storage medium
CN115631502A (en) Character recognition method, character recognition device, model training method, electronic device and medium
CN114898742A (en) Method, device, equipment and storage medium for training streaming voice recognition model
CN114817476A (en) Language model training method and device, electronic equipment and storage medium
CN113553833A (en) Text error correction method and device and electronic equipment
CN112541557A (en) Training method and device of generative confrontation network and electronic equipment
CN116244432B (en) Pre-training method and device for language model and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant