CN114861637B - Spelling error correction model generation method and device, and spelling error correction method and device - Google Patents

Spelling error correction model generation method and device, and spelling error correction method and device Download PDF

Info

Publication number
CN114861637B
CN114861637B CN202210546618.2A CN202210546618A CN114861637B CN 114861637 B CN114861637 B CN 114861637B CN 202210546618 A CN202210546618 A CN 202210546618A CN 114861637 B CN114861637 B CN 114861637B
Authority
CN
China
Prior art keywords
error correction
sample
text
model
spelling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210546618.2A
Other languages
Chinese (zh)
Other versions
CN114861637A (en
Inventor
马芸
桂睿
曹宇慧
黄硕
陈永锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202210546618.2A priority Critical patent/CN114861637B/en
Publication of CN114861637A publication Critical patent/CN114861637A/en
Application granted granted Critical
Publication of CN114861637B publication Critical patent/CN114861637B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The disclosure provides a spelling error correction model generation method and device, relates to the technical field of artificial intelligence, in particular to the technical fields of deep learning, natural language processing and the like, and can be applied to scenes such as OCR and the like. The specific implementation scheme is as follows: obtaining an error correction sample set comprising at least one error correction sample; based on the error correction sample set, performing spelling error correction training on the pre-trained text recognition model to obtain an error correction model to be adjusted; selecting a low-frequency sample comprising a low-frequency vocabulary from the error correction sample set to obtain a low-frequency sample set; and performing spelling error correction training on the error correction model to be adjusted based on the low-frequency sample set to obtain the spelling error correction model. This embodiment improves the generalization ability of the spelling error correction model to spelling errors.

Description

Spelling error correction model generation method and device, and spelling error correction method and device
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to the field of deep learning, natural language processing, and the like, and may be applied to scenes such as OCR, and more particularly, to a spelling error correction model generating method and apparatus, a spelling error correction method and apparatus, an electronic device, a computer readable medium, and a computer program product.
Background
The spelling error correction system aims to automatically recognize misspelled words in text and give corresponding modification suggestions based on natural language processing techniques. The traditional spelling error correction system mostly adopts a technical route combining rule matching and sequencing model: rule matching is recalled based on dictionary resources and editing distance, and recalled candidates are input into a sorting model through feature extraction to obtain scoring and form an error correction result. The traditional spelling error correction technology combining rule matching with a sequencing model excessively depends on dictionary resources and feature engineering, has high labor cost and lacks generalization capability.
Disclosure of Invention
A spelling error correction model generation method and apparatus, an electronic device, a computer readable medium, and a computer program product are provided.
According to a first aspect, there is provided a spelling error correction model generation method, the method comprising: obtaining an error correction sample set comprising at least one error correction sample; based on the error correction sample set, performing spelling error correction training on the pre-trained text recognition model to obtain an error correction model to be adjusted; selecting a low-frequency sample comprising a low-frequency vocabulary from the error correction sample set to obtain a low-frequency sample set; and performing spelling error correction training on the error correction model to be adjusted based on the low-frequency sample set to obtain the spelling error correction model.
According to a second aspect, there is provided a spelling error correction method, the method comprising: acquiring text data to be corrected; inputting the text data to be corrected into a spelling correction model generated by the method described in any implementation manner of the first aspect, and obtaining the error target in the text data to be corrected and the correction result of the error target.
According to a third aspect, there is provided a spelling error correction model generating apparatus, comprising: an error correction acquisition unit configured to acquire an error correction sample set including at least one error correction sample; the training unit to be adjusted is configured to perform spelling error correction training on the pre-trained text recognition model based on the error correction sample set to obtain an error correction model to be adjusted; a low-frequency acquisition unit configured to select a low-frequency sample including a low-frequency vocabulary from the error correction sample set to obtain a low-frequency sample set; and the spelling training unit is configured to perform spelling error correction training on the error correction model to be adjusted based on the low-frequency sample set to obtain the spelling error correction model.
According to a fourth aspect, there is provided a spelling error correction apparatus, comprising: a text acquisition unit configured to acquire text data to be corrected; the obtaining unit is configured to input the text data to be corrected into the spelling correction model generated by the device described in any implementation manner of the third aspect, so as to obtain the error target in the text data to be corrected and a correction result of the error target.
According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first or second aspect.
According to a sixth aspect, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a method as described in any implementation of the first or second aspect.
According to a seventh aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first or second aspects.
The embodiment of the disclosure provides a spelling error correction model generating method and device, firstly, an error correction sample set comprising at least one error correction sample is obtained; secondly, based on the error correction sample set, performing spelling error correction training on the pre-trained text recognition model to obtain an error correction model to be adjusted; thirdly, selecting a low-frequency sample comprising a low-frequency vocabulary from the error correction sample set to obtain a low-frequency sample set; and finally, performing spelling error correction training on the error correction model to be adjusted based on the low-frequency sample set to obtain the spelling error correction model. Therefore, the low-frequency vocabulary in the error correction sample set is adopted to finely adjust the error correction model to be adjusted, so that the understanding capability of the spelling error correction model to the low-frequency vocabulary can be improved, the error correction phenomenon is reduced, and the generalization of the spelling error correction model and the performance of the spelling error correction model in a spelling error correction task are improved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flow chart of one embodiment of a spelling error correction model generation method according to the present disclosure;
FIG. 2 is a schematic diagram of a structure for orthographic error correction model generation in an embodiment of the present disclosure;
FIG. 3 is a flow chart of one embodiment of a spelling error correction method according to the present disclosure;
FIG. 4 is a schematic diagram of an embodiment of a orthographic error correction model generation apparatus according to the present disclosure;
FIG. 5 is a schematic diagram of an embodiment of a spelling error correction device according to the present disclosure;
FIG. 6 is a block diagram of an electronic device for implementing a spelling error correction model generation method, a spelling error correction method, and a spelling error correction method of embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In this embodiment, "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature.
FIG. 1 illustrates a flow 100 of one embodiment of a spelling error correction model generation method according to the present disclosure, which includes the steps of:
step 101, an error correction sample set comprising at least one error correction sample is obtained.
In this embodiment, the error correction sample set is a text data set acquired by an execution body on which the spelling error correction model generation method is run in order to train the spelling error correction model. The execution body of the spelling error correction model generation method can obtain the error correction sample set in various ways. For example, the execution body may acquire the error correction sample set stored therein from the database server through a wired connection or a wireless connection. For another example, the executing body may also receive an error correction sample set collected in real time by the terminal or other device.
In this embodiment, the error correction sample set includes at least one error correction sample, each error correction sample may be a piece of text data, some wrongly written words in the piece of text data are marked with corresponding word labels, and the word labels are correct words corresponding to the wrongly written words; optionally, some of the misclassified words in the text data are labeled with corresponding word tags, which are correct words corresponding to the misclassified words.
Optionally, the error correction sample set includes at least one piece of text data, each piece of text data including: an original text and a text label corresponding to the original text, the text label being the correct text corresponding to the original text.
And 102, performing spelling error correction training on the pre-trained text recognition model based on the error correction sample set to obtain a to-be-adjusted error correction model.
In this embodiment, the pre-trained text recognition model is a model obtained by training a mask language model, and the text recognition model is used for predicting the content in the text.
The training process of the pre-trained text recognition model is as follows: on large-scale unmarked text data, randomly replacing a part of characters in the text data with special characters (the special characters are identified as masks for the characters by a text identification model), inputting the replaced text data and original data into a text identification network corresponding to the text identification model, obtaining a prediction result of the text identification network on the replaced text data through coding of the text identification network, and adjusting parameters of the text identification network based on the prediction result and the original data until the iterative training times of the text identification network reach a training threshold or the loss value of the text identification network reaches a loss value threshold, so as to obtain a text identification model, wherein the text identification model can finally predict the original characters of the special character positions after any one of the replaced texts is input.
In this embodiment, the network structure of the text recognition model may employ Ernie (Enhanced Representation from Knowledge Integration, knowledge-enhanced semantic representation model) and other bidirectional models based on the transducer structure, such as BERT (Bidirectional Encoder Representation from Transformers, bidirectional transducers encoder), electrora (Efficiently Learning an Encoder that Classifies Token Replacements Accurately, encoder for efficiently learning to accurately classify Token substitution), and the like.
In this embodiment, the pre-trained text recognition model does not have error correction capability, that is, the pre-trained text recognition model does not have any error correction capability, and after text data with mask shielding is input, the text recognition model can only predict mask portions in the input text data, and predict characters of the mask portions.
In this embodiment, spelling correction training is performed on the text recognition model through the correction sample set, and the obtained to-be-adjusted correction model is a model with a certain correction capability, but the correction capability of the to-be-adjusted correction model is not mature, and the correction capability is weak.
And 103, selecting a low-frequency sample comprising a low-frequency vocabulary from the error correction sample set to obtain a low-frequency sample set.
In this embodiment, the low-frequency samples are error correction samples with a low occurrence ratio of the error correction sample set, and the number of occurrences of the low-frequency vocabulary in the low-frequency samples in all error correction samples in the error correction sample set is small, and because the number of occurrences of the low-frequency vocabulary in the low-frequency samples is small, the model is very easy to consider the position of the low-frequency vocabulary to be an error, so that error correction occurs, therefore, the low-frequency samples are selected to form the low-frequency sample set, and the low-frequency sample set is specially used for performing error correction training on the error correction model to be adjusted, so that the recognition capability of the error correction model to be adjusted on the low-frequency samples can be improved.
And 104, performing spelling error correction training on the to-be-adjusted error correction model based on the low-frequency sample set to obtain the spelling error correction model.
In this embodiment, the training step of the error correction model to be adjusted includes: step one, selecting a low-frequency sample in a low-frequency sample set; inputting the selected low-frequency sample into a to-be-adjusted error correction model, enabling the to-be-adjusted error correction model to encode the selected low-frequency sample, and predicting the real text of each text position in the selected low-frequency sample; step three, calculating a loss value of the error correction model to be adjusted based on the text predicted by the error correction model to be adjusted and the selected low-frequency sample; and step four, if the to-be-adjusted error correction model does not meet the training completion condition, adjusting parameters of the to-be-adjusted error correction model, and continuing to execute the steps one to four until the to-be-adjusted error correction model meets the training completion condition, and taking the to-be-adjusted error correction model as a spelling error correction model. In this embodiment, the training completion conditions include: the loss value of the error correction model to be adjusted reaches a certain loss threshold value or the training iteration number of the error correction model to be adjusted reaches preset times, wherein the training iteration number refers to the number of times from the execution of the first step to the execution of the fourth step.
In this embodiment, the training of the to-be-adjusted error correction model is continued by selecting the low-frequency sample containing the low-frequency vocabulary, so that the trained spelling error correction model can better understand the semantics of the low-frequency vocabulary, and the error correction phenomenon of the spelling error correction model is reduced.
Optionally, the spelling error correction model generating method may further include: selecting error correction samples which are easy to be error-corrected from the error correction sample set to obtain an error-corrected sample set; and training the spelling error correction model by adopting a sample set easy to error correction to obtain a final error correction model. In this embodiment, the error correction samples in the error correction sample set are sample types that are prone to error correction, such as error correction samples containing special names (e.g., personal names, place names, etc.).
The spelling error correction model generating method provided by the embodiment of the disclosure includes the steps of firstly, acquiring an error correction sample set comprising at least one error correction sample; secondly, based on the error correction sample set, performing spelling error correction training on the pre-trained text recognition model to obtain an error correction model to be adjusted; thirdly, selecting a low-frequency sample comprising a low-frequency vocabulary from the error correction sample set to obtain a low-frequency sample set; and finally, performing spelling error correction training on the error correction model to be adjusted based on the low-frequency sample set to obtain the spelling error correction model. Therefore, the low-frequency vocabulary in the error correction sample set is adopted to finely adjust the error correction model to be adjusted, so that the understanding capability of the spelling error correction model to the low-frequency vocabulary can be improved, the error correction phenomenon is reduced, and the generalization of the spelling error correction model and the performance of the spelling error correction model in a spelling error correction task are improved.
In some embodiments of the present disclosure, the foregoing spelling error correction model generating method further includes: in the training process of the error correction model to be adjusted, carrying out contrast learning on semantic representation of a first target position of a low-frequency sample in a low-frequency sample set to obtain a first contrast learning loss; and adjusting parameters of the error correction model to be adjusted based on the first contrast learning loss.
In this embodiment, in the training process of the to-be-adjusted error correction model, it means that in each iterative training process of the to-be-adjusted error correction model, after the to-be-adjusted error correction model encodes the low frequency sample, a target position (a position where a part of characters are located) is randomly selected besides the real text of each position of the low frequency sample to be predicted, a comparison learning target is added to the target position, and the target is that semantic representation of the position in the to-be-adjusted error correction model is close to a positive sample preset for the to-be-adjusted error correction model and far away from a negative sample preset for the to-be-adjusted error correction model.
In this embodiment, the semantic representation may select the output of the last layer of the error correction model to be tuned, for example, when the error correction model to be tuned adopts an Ernie encoder, the semantic representation is the output of the last layer of the Ernie encoder.
In this embodiment, the low-frequency sample in the low-frequency sample set is a low-frequency sample selected from the low-frequency samples in the current iterative training process of the to-be-adjusted error correction model, the first target position of the low-frequency sample is the position of each text (such as the text correct position with correct text or word or the text error position with incorrect text or word), and the semantic representation of the first target position is the vector representation of the text in the last layer of the to-be-adjusted error correction model. The semantic representation of the first target position of the error correction model to be adjusted is compared with a positive sample and a negative sample which are constructed in advance, so that first comparison learning loss is determined, the semantic representation of the first target position in the comparison process is maximally close to the positive sample and far away from the negative sample, and when the semantic representation reaches the optimal value, the prediction result of the error correction model to be adjusted is determined to be optimal.
According to the spelling error correction model generation method, a contrast learning mechanism is introduced in the training process of the to-be-adjusted error correction model, so that error correction caused by insufficient learning of the to-be-corrected model on the low-frequency sample can be reduced.
In some optional implementations of this embodiment, the first target location is a text correct location, and performing contrast learning on semantic representation of the first target location of the low-frequency sample in the low-frequency sample set to obtain a first contrast learning loss includes: performing similarity comparison by adopting a pre-constructed positive sample and semantic representation of the correct position of the text to obtain a first positive similarity; performing similarity comparison by adopting a pre-constructed negative sample and semantic representation of the correct position of the text to obtain a first negative similarity; and calculating to obtain a first contrast learning loss based on the first positive similarity and the first negative similarity.
In this optional implementation manner, the first target position is a position where the text is correct, and the text or the word is correct in the low-frequency sample, for example, the low-frequency sample is: today, the weather is very good, wherein the place of the "day" is the correct place of the text, and the place of the "abruption" is the wrong place of the text. As shown in FIG. 2, in the training process of the to-be-adjusted recognition model, a text correct position contrast learning mechanism is added, so that the training reliability of the spelling error correction model can be improved.
In this optional implementation manner, the first positive similarity is used to reflect the similarity between the semantic representation of the correct position of the text and the positive sample, and the larger the value of the first positive similarity is, the more similar the semantic representation of the correct position of the text and the positive sample is; the first negative similarity is used for reflecting the similarity between the semantic representation of the correct position of the text and the negative sample, and the larger the value of the first negative similarity is, the more similar the semantic representation of the correct position of the text and the negative sample are.
In this optional implementation manner, the calculating to obtain the first contrast learning loss based on the first positive similarity and the first negative similarity includes: and substituting the first positive similarity and the first negative similarity into a contrast loss calculation formula to obtain a first contrast learning loss. Specifically, the comparative loss calculation formula may employ the following formula (1):
Figure BDA0003649769390000071
in formula (1), a 1 Representing a first positive similarity, b 1j The first negative similarity is represented, and because the negative samples are samples at randomly selected positions, the negative samples do not need to be additionally searched, are directly selected randomly from the low-frequency sample set, and do not need to repeatedly pass through the error correction model to be adjusted, so that the first negative similarity is also multiple in the contrast loss calculation formula, and K negative samples are correspondingly multiple in the formula (1), wherein K is a natural number greater than 1.
According to the first contrast learning loss calculation method provided by the alternative implementation mode, text correct position contrast learning is combined with fine adjustment of the low-frequency sample of the error correction model to be adjusted, so that understanding of the error correction model to the low-frequency vocabulary in the correct text is enhanced, and error correction caused by insufficient learning of the low-frequency vocabulary is reduced.
In this alternative implementation, the positive sample is configured by at least one of: (z 1) semantically characterizing the first target position after cutting off the input low-frequency sample; (z 2) utilizing randomness of a discarding layer in the error correction model to be adjusted to carry out semantic representation of a first target position after carrying out additional feedforward process on an input low-frequency sample; (z 3) adding a semantic representation of the first target location after the challenge perturbation value to the word vector of the input low frequency samples. In this optional implementation manner, the input low-frequency samples refer to low-frequency samples in the error correction model to be adjusted currently in the current iteration training round of the error correction model to be adjusted.
In this alternative, the positive sample is a sample that the low frequency sample should be close to, and (z 1) in the positive sample configuration is specifically implemented as: given a low-frequency sample, selecting x of a first target position of the low-frequency sample, cutting off the low-frequency sample (the selected x cannot be cut off), taking the cut text as a new input, and coding by a to-be-adjusted error correction model, wherein the semantic representation corresponding to the coded x is called as a positive sample of the semantic representation corresponding to the coded x of the original complete text.
In this alternative implementation, the negative sample is configured by at least one of: (t 1) inputting a sample of the easily-mixed label containing the real label of the first target position into a to-be-adjusted error correction model to obtain semantic representation of the position of the easily-mixed label; (t 2) obtaining semantic representations of random locations of other random samples. In this alternative implementation, the other random samples are completely different samples from the low frequency samples currently input into the to-be-tuned error correction model in the current iteration training round.
In this alternative implementation, the negative sample refers to a sample from which the semantic representation of the text of the selected first target location should be far away. The assumption in the negative sample construction method (t 1) is that a piece of random input text (different from the current input text) should be uncorrelated with the current input text, so that any labeled semantic representation of the random sample after model coding can be used as a negative sample. In specific practice, since model training is performed in batches (batch), other samples in the same batch as the current input sample can be directly selected as random samples.
In some optional implementations of the present implementation, the set of error correction samples includes: the training of spelling error correction is carried out on the pre-trained text recognition model based on the error correction sample set to obtain a to-be-adjusted error correction model, which comprises the following steps: performing spelling error correction training on the text recognition model by adopting a pseudo error correction sub-sample set to obtain an initial error correction model; and performing spelling error correction training on the initial error correction model by adopting a true error correction sub-sample set to obtain the error correction model to be adjusted.
In this embodiment, the set of pseudo error correction sub-samples includes at least one pseudo error correction sample, where the pseudo error correction sample is a sample obtained by replacing a constructed error correction sample with a characteristic of a sound and/or shape of a text. The true error correction sub-sample set comprises at least one true error correction sample, wherein the true error correction sample is a manually marked true error correction sample. It should be noted that the true error correction sub-sample set may be a sample set obtained in real time from a third party and completed by manual labeling.
In this alternative implementation manner, the text recognition model has little error correction capability, and the initial error correction model obtained by performing spelling error correction training on the text recognition model by using the pseudo error correction sub-sample set has weaker error correction capability at this time, that is, the error correction capability is not mature. The reason for the weakness is that: the training data is automatically generated instead of the real data, and the real data and the automatically generated data are different, so that the model trained on the automatically generated data only cannot fully generalize the capability of the model to the real data scene.
Further, the original error correction model is subjected to spelling error correction training by adopting the true error correction sub-sample set, the obtained error correction model to be adjusted can be completely generalized to a real data scene, and the error correction capability of the error correction model to be adjusted is improved.
As shown in fig. 2, the spelling error correction model training process includes three phases, wherein the first two phases of the three phases are the to-be-tuned error correction model training process:
the first stage: the training method comprises the steps of constructing a pseudo error correction sub-sample set by utilizing sound/shape near substitution, performing spelling error correction training on a pre-trained text recognition model, wherein the training target is the real text of each position of a predicted pseudo error correction sample, and finally generating an initial error correction model for second-stage training.
And a second stage: the stage utilizes the real pair of artificial labeling to construct a true error correction sub-sample set, performs spelling error correction training again on the initial error correction model obtained in the stage 1, and performs loss value calculation by adopting the same loss function as the first stage, wherein the training target is the same as the first stage. The real data is adopted in the stage, so that the error correction capability of the initial error correction model for the real data is enhanced after the stage, and the stage finally trains and generates the error correction model to be adjusted for the third stage.
And a third stage: the method comprises the steps of selecting samples containing low-frequency words in a true error correction sub-sample set, fine-adjusting the to-be-adjusted identification model obtained in the step 2, calculating a loss value by adopting a loss function identical to that of the first step, and finally training to obtain a spelling error correction model, wherein a training target is identical to that of the first step.
According to the training method for the error correction model to be adjusted, which is provided by the alternative implementation mode, a text recognition model is trained through a pseudo error correction sub-sample set at one stage, and an initial error correction model is obtained; and training an initial error correction model through a true error correction sub-sample set at the other stage to obtain a to-be-adjusted error correction model, so that the error correction capability and the generalization capability of the to-be-adjusted error correction model are improved.
In some embodiments of the present disclosure, the foregoing spelling error correction model generating method further includes: in the training process of the text recognition model and the initial error correction model, carrying out contrast learning on semantic characterization of a second target position in the error correction sample to obtain a second contrast learning loss; based on the second contrast learning penalty, parameters of the text recognition model and the initial correction model are adjusted.
In this embodiment, in each iterative training process of the text recognition model, after the text recognition model encodes the pseudo error correction samples in the pseudo error correction sub-sample set inputted currently, a target position (a position where a part of characters are located) is randomly selected in addition to the real text of each position where the pseudo error correction samples are to be predicted, a comparison learning target is added to the target position, and the target is that the semantic representation of the position in the error correction model to be adjusted is close to a positive sample preset for the error correction model to be adjusted and far away from a negative sample preset for the error correction model to be adjusted.
In each iterative training process of the initial error correction model, after the initial error correction model encodes the true error correction samples in the true error correction sub-sample set which is input currently, a target position (a position where a part of characters are located) is randomly selected besides the true text of each position of the true error correction samples, a comparison learning target is added to the target position, and the target is that the semantic representation of the position in the error correction model to be adjusted is close to a positive sample preset for the error correction model to be adjusted and is far away from a negative sample preset for the error correction model to be adjusted.
According to the spelling error correction model generation method, a contrast learning mechanism is introduced in the training process of the text recognition model and the initial error correction model, so that the generalization capability of the model on spelling errors can be improved, and the missing error correction phenomenon is reduced.
In some optional implementations of this embodiment, the second target position is a text error position, and performing contrast learning on the semantic representation of the second target position in the error correction sample to obtain a second contrast learning loss includes: performing similarity comparison by adopting a pre-constructed positive sample and semantic representation of a text error position to obtain a second positive similarity; performing similarity comparison by adopting a pre-constructed negative sample and semantic representation of a text error position to obtain a second negative similarity; and calculating a second contrast learning loss based on the second positive similarity and the second negative similarity.
In this optional implementation manner, the second target position is a text error position, which refers to a position of a text or word error in an error correction sample in the error correction sample set, where the error correction sample set includes: the second target positions are the positions of the text or word errors in the false error correction sample and the positions of the text or word errors in the true error correction sample, and are specifically shown in fig. 2.
In this optional implementation manner, the calculating to obtain the second contrast learning loss based on the second positive similarity and the second negative similarity includes: and bringing the second positive similarity and the second negative similarity into a contrast loss calculation formula to obtain a second contrast learning loss. Specifically, the comparative loss calculation formula may employ the following formula (2):
Figure BDA0003649769390000111
in formula (1), a 2 Representing a second positive similarity, b 2j And representing the second negative similarity, wherein the negative samples are samples of randomly selected positions, the negative samples do not need to be additionally searched, are directly randomly selected from the low-frequency sample set, and do not need to repeatedly pass through a text recognition model or an initial error correction model, so that a plurality of negative samples are arranged in a contrast loss calculation formula, and correspondingly, the second negative similarity also has a plurality of negative samples, such as K negative samples in the formula (1), wherein K is a natural number greater than 1.
According to the second contrast learning loss calculation method provided by the alternative implementation mode, contrast learning of semantic representation of text error positions is used as an auxiliary task of pre-training a text recognition model by a false error correction sub-sample set and error correction fine adjustment of an initial error correction model by a true error correction sub-sample set, so that robustness of the model to errors is improved, and missing correction caused by context change is reduced.
In some optional implementations of this embodiment, the step of obtaining the set of pseudo error correction subsamples is as follows: acquiring an initial text sample set; determining a word or word sound or shape near replacement words of each text sample in the initial text sample set; and replacing the characters or words of each text sample in the initial text sample set by using the replacement words to obtain a pseudo error correction sub-sample set.
Optionally, the step of obtaining the pseudo error correction sub-sample set may further be as follows: acquiring an initial text sample set; determining the word or word sound proximity and shape proximity replacement words of each text sample in the initial text sample set; and replacing the characters or words of each text sample in the initial text sample set by using the replacement words to obtain a pseudo error correction sub-sample set.
According to the method for obtaining the pseudo error correction sub-sample set, after the initial text sample set is obtained, the words or the words of the text samples in the initial text sample set are replaced by the replacement words close to the words or the words of the text samples in the initial text sample set, so that the pseudo error correction sub-sample set can be expanded to the maximum extent, and the effect of enhancing sample data is achieved.
In some alternative implementations of the present implementation, the positive sample is constructed by at least one of: semantic representation of a second target position after the input error correction sample is cut off; using randomness of a discarding layer in the model to carry out semantic representation of a second target position after carrying out an additional feedforward process on an input error correction sample; and adding semantic representation of the second target position after the anti-disturbance value to the word vector of the input error correction sample.
In this alternative implementation, the models may be a text recognition model and an initial error correction model. When the text recognition model is trained, the randomness of a discarding layer in the text recognition model can be utilized to carry out semantic representation of a second target position after an additional feedforward process on an input error correction sample; when the initial error correction model is trained, the randomness of a discarding layer in the initial error correction model can be utilized to carry out semantic representation of a second target position after an additional feedforward process on the input error correction sample. In this embodiment, the disturbance countermeasure value is an arbitrary value that can be added to the word vector.
In this alternative implementation manner, the positive sample configuration manner may refer to the positive sample configuration manner corresponding to the low-frequency sample in the foregoing embodiment.
The positive sample construction method provided by the alternative implementation mode realizes positive samples in various modes, and improves the diversity of positive sample acquisition.
In some alternative implementations of the present embodiment, the negative sample is constructed by at least one of: inputting a sample of the easily-mixed label containing the real label of the second target position into a model to obtain semantic representation of the position of the easily-mixed label; semantic characterization of random locations of other random samples is obtained.
In this alternative implementation manner, the negative sample may be configured by referring to the negative sample configuration manner corresponding to the low frequency sample in the foregoing embodiment.
The negative sample construction method provided by the alternative implementation mode adopts various modes to realize the negative samples, and improves the diversity of the negative sample acquisition.
FIG. 3 illustrates a flow chart 300 of one embodiment of a spelling error correction method of the present disclosure, comprising the steps of:
in step 301, text data to be corrected is acquired.
In this embodiment, the text data to be corrected is text data to be detected, and the text or the words in the text data to be corrected may be partially correct or may be completely correct. The execution body of the spelling error correction method can acquire text data to be error corrected in a variety of ways. For example, the execution subject may acquire text data to be corrected stored therein from the database server through a wired connection or a wireless connection. For another example, the execution subject may also receive text data to be corrected collected in real time by the terminal or other devices.
In this embodiment, the text data to be corrected may be text data of a piece of text, text data of a plurality of pieces of text, or the like, and the format of the text data to be corrected is not limited in this disclosure.
And 302, inputting the text data to be corrected into a spelling correction model generated by adopting a spelling correction model generation method, and obtaining the error target in the text data to be corrected and the correction result of the error target.
In this embodiment, the execution body may input the text data to be corrected acquired in step 301 into the spelling correction model, so as to obtain the error target output by the spelling correction model and the correction result of the error target. The error target is an error word or an error word in the text data to be corrected, and when the error target is the error word, the correction result of the error target is a correct word corresponding to the error word; when the wrong target is a wrong word, the correction result of the wrong target is a correct word corresponding to the wrong word. Optionally, the correction result of the error target may further include: the position information of the correct text or the correct word corresponding to the error target (the coordinates of the correct text or the correct word in the text data to be corrected, etc.).
In this embodiment, the orthographic error correction model may be generated using the method described above with respect to the embodiment of FIG. 1. The specific generation process may be referred to in the description of the embodiment of fig. 1, and will not be described herein.
It should be noted that, the spelling error correction method of the present embodiment may be used to test the spelling error correction model generated in the above embodiments. And the spelling error correction model can be continuously optimized according to the error target and the correction result of the error target. The method may also be a practical application method of the spelling error correction model generated in the above embodiments. The spelling correction model generated by the embodiments is used for spelling correction, so that the correctness of the text data to be corrected is improved, and the reliability of the text editing rule is improved.
According to the spelling error correction method, the text data to be corrected is obtained, the text data to be corrected is input into the pre-trained spelling error correction model, the error targets in the text data to be corrected and the correction results of the error targets can be effectively identified, and the identification efficiency of the error targets is improved.
With further reference to fig. 4, as an implementation of the method illustrated in the above figures, the present disclosure provides an embodiment of a spelling error correction model generating apparatus, which corresponds to the method embodiment illustrated in fig. 1, and which is particularly applicable to various electronic devices.
As shown in fig. 4, the spelling error correction model generating apparatus 400 provided in this embodiment includes: error correction acquisition unit 401, to-be-adjusted training unit 402, low frequency acquisition unit 403, and spelling training unit 404. Wherein the above described error correction acquisition unit 401 may be configured to acquire an error correction sample set comprising at least one error correction sample. The training unit 402 may be configured to perform spelling correction training on the pre-trained text recognition model based on the correction sample set to obtain a training correction model. The low frequency acquisition unit 403 may be configured to select a low frequency sample including a low frequency vocabulary from the error correction sample set, resulting in a low frequency sample set. The spelling training unit 404 may be configured to perform spelling error correction training on the to-be-tuned error correction model based on the low frequency sample set to obtain a spelling error correction model.
In the present embodiment, in the spelling error correction model generation device 400: the specific processing of the error correction obtaining unit 401, the training unit to be adjusted 402, the low frequency obtaining unit 403, and the spelling training unit 404 and the technical effects thereof may refer to the relevant descriptions of step 101, step 102, step 103, and step 104 in the corresponding embodiment of fig. 1, and are not described herein.
In some optional implementations of this embodiment, the apparatus 400 further includes: a first contrast learning unit (not shown in the figure), and a first adjustment unit (not shown in the figure). The first contrast learning unit may be configured to perform contrast learning on semantic representation of a first target position of a low-frequency sample in the low-frequency sample set in a training process of the error correction model to be adjusted, so as to obtain a first contrast learning loss. The first adjustment unit may be configured to adjust parameters of the error correction model to be adjusted based on the first contrast learning loss.
In some optional implementations of this embodiment, the first target location is a text correct location, and the first contrast learning unit is further configured to: performing similarity comparison by adopting a pre-constructed positive sample and semantic representation of the correct position of the text to obtain a first positive similarity; performing similarity comparison by adopting a pre-constructed negative sample and semantic representation of the correct position of the text to obtain a first negative similarity; and calculating to obtain a first contrast learning loss based on the first positive similarity and the first negative similarity.
In some optional implementations of this embodiment, the set of error correction samples includes: the set of pseudo-error correction subsamples and the set of true-error correction subsamples, the training unit to be tuned 402 is further configured to: performing spelling error correction training on the text recognition model by adopting a pseudo error correction sub-sample set to obtain an initial error correction model; and performing spelling error correction training on the initial error correction model by adopting a true error correction sub-sample set to obtain the error correction model to be adjusted.
In some optional implementations of this embodiment, the apparatus 400 further includes: a second contrast learning unit (not shown in the figure), and a second adjusting unit (not shown in the figure). The second comparison learning unit may be configured to perform comparison learning on the semantic representation of the second target position in the error correction sample in the training process of the text recognition model and the initial error correction model, so as to obtain a second comparison learning loss. The second adjustment unit may be configured to adjust parameters of the text recognition model and the initial correction model based on the second contrast learning loss.
In some optional implementations of this embodiment, the second contrast learning unit is further configured to: performing similarity comparison by adopting a pre-constructed positive sample and semantic representation of a text error position to obtain a second positive similarity; performing similarity comparison by adopting a pre-constructed negative sample and semantic representation of a text error position to obtain a second negative similarity; and calculating a second contrast learning loss based on the second positive similarity and the second negative similarity.
In some optional implementations of the present embodiment, the above-mentioned pseudo error correction sub-sample set is obtained by using a sample aggregation unit (not shown in the figure); the sample aggregation unit may be configured to: acquiring an initial text sample set; determining a word or word sound or shape near replacement words of each text sample in the initial text sample set; and replacing the characters or words of each text sample in the initial text sample set by using the replacement words to obtain a pseudo error correction sub-sample set.
In some alternative implementations of this embodiment, the positive sample is obtained by at least one of the following cell configurations: a truncating unit (not shown in the figure), a feedforward unit (not shown in the figure), and an adding unit (not shown in the figure). The truncation unit may be configured to perform semantic representation of the second target position after truncation of the input error correction sample. The feedforward unit can be configured to obtain semantic representation of the target position after performing additional feedforward process on the input error correction sample by utilizing randomness of the discarding layer in the model; the adding unit may be configured to add a semantic representation of the second target position after the anti-disturbance value to the word vector of the input error correction sample.
In some alternative implementations of this embodiment, the negative sample is obtained by at least one of the following cell configurations: an input unit (not shown in the figure) and a random acquisition unit (not shown in the figure), wherein the input unit may be configured to obtain a semantic representation of the position of the confusing label after inputting a sample of the confusing label containing the true label of the second target position into the model. The random acquisition unit may be configured to acquire semantic representations of random positions of other random samples.
The spelling error correction model generation device provided by the embodiment of the present disclosure, first, an error correction acquisition unit 401 acquires an error correction sample set including at least one error correction sample; secondly, the training unit to be adjusted 402 performs spelling error correction training on the pre-trained text recognition model based on the error correction sample set to obtain an error correction model to be adjusted; again, the low frequency acquisition unit 403 selects a low frequency sample including a low frequency vocabulary from the error correction sample set, resulting in a low frequency sample set; finally, the spelling training unit 404 performs spelling error correction training on the to-be-adjusted error correction model based on the low frequency sample set to obtain a spelling error correction model. Therefore, the low-frequency vocabulary in the error correction sample set is adopted to finely adjust the error correction model to be adjusted, so that the understanding capability of the spelling error correction model to the low-frequency vocabulary can be improved, the error correction phenomenon is reduced, and the generalization of the spelling error correction model and the performance of the spelling error correction model in a spelling error correction task are improved.
With continued reference to FIG. 5, as an implementation of the method of FIG. 3 described above, the present application provides one embodiment of a spelling error correction device. The embodiment of the device corresponds to the embodiment of the method shown in fig. 3, and the device can be applied to various electronic devices.
As shown in fig. 5, the spelling error correction device 500 of the present embodiment may include: the text acquisition unit 501 is configured to acquire text data to be corrected. The result obtaining unit 502 is configured to input the text data to be corrected into the spelling correction model generated by the apparatus described in the embodiment of fig. 4, to obtain the error target in the text data to be corrected and the correction result of the error target.
It will be appreciated that the elements described in the apparatus 500 correspond to the various steps in the method described with reference to fig. 3. Thus, the operations, features and resulting benefits described above with respect to the method are equally applicable to the apparatus 500 and the units contained therein, and are not described in detail herein.
In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM602, and RAM603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various methods and processes described above, such as a spelling error correction model generation method, a spelling error correction method. For example, in some embodiments, the spelling error correction model generation method, the spelling error correction method, may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM602 and/or the communication unit 609. When the computer program is loaded into the RAM603 and executed by the computing unit 601, one or more steps of the spelling error correction model generation method, the spelling error correction method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the spelling error correction model generation method, the spelling error correction method, in any other suitable way (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable spelling error correction model generation device, spelling error correction device, such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (18)

1. A method of orthographic error correction model generation, the method comprising:
obtaining an error correction sample set comprising at least one error correction sample;
performing spelling error correction training on the pre-trained text recognition model based on the error correction sample set to obtain an error correction model to be adjusted;
selecting a low-frequency sample comprising a low-frequency vocabulary from the error correction sample set to obtain a low-frequency sample set;
based on the low-frequency sample set, performing spelling error correction training on the error correction model to be adjusted to obtain a spelling error correction model; in the training process of the error correction model to be adjusted, carrying out contrast learning on semantic characterization of a first target position of a low-frequency sample in the low-frequency sample set to obtain a first contrast learning loss;
based on the first contrast learning loss, adjusting parameters of the error correction model to be adjusted;
The first target position is a text correct position, and the performing contrast learning on the semantic representation of the first target position of the low-frequency sample in the low-frequency sample set to obtain a first contrast learning loss includes:
performing similarity comparison by adopting a pre-constructed positive sample and semantic characterization of the correct position of the text to obtain a first positive similarity;
performing similarity comparison between a pre-constructed negative sample and semantic representation of the correct position of the text to obtain a first negative similarity;
and calculating the first contrast learning loss based on the first positive similarity and the first negative similarity.
2. The method of claim 1, wherein the set of error correction samples comprises: the training method comprises the steps of performing spelling error correction training on a pre-trained text recognition model based on an error correction sample set to obtain a to-be-adjusted error correction model, wherein the training method comprises the following steps of:
performing spelling error correction training on the text recognition model by adopting the pseudo error correction sub-sample set to obtain an initial error correction model;
and performing spelling error correction training on the initial error correction model by adopting the true error correction sub-sample set to obtain a to-be-adjusted error correction model.
3. The method of claim 2, the method further comprising:
in the training process of the text recognition model and the initial error correction model, performing contrast learning on semantic characterization of a second target position in an error correction sample to obtain a second contrast learning loss;
and adjusting parameters of the text recognition model and the initial error correction model based on the second contrast learning loss.
4. The method of claim 3, wherein the second target location is a text error location, the performing contrast learning on the semantic representation of the second target location in the error correction sample to obtain a second contrast learning loss, comprising:
performing similarity comparison by adopting a pre-constructed positive sample and semantic representation of the text error position to obtain a second positive similarity;
performing similarity comparison by adopting a pre-constructed negative sample and semantic representation of the text error position to obtain a second negative similarity;
and calculating the second contrast learning loss based on the second positive similarity and the second negative similarity.
5. The method of claim 2, wherein the obtaining of the set of pseudo-error correction subsamples is as follows:
acquiring an initial text sample set;
Determining the word or word sound or shape near replacement words of each text sample in the initial text sample set;
and replacing the characters or words of each text sample in the initial text sample set by the replacement words to obtain a pseudo error correction sub-sample set.
6. The method of claim 4, wherein the positive sample is constructed by at least one of: semantic representation of a second target position after the input error correction sample is cut off; using randomness of a discarding layer in the model to carry out semantic representation of a second target position after carrying out an additional feedforward process on an input error correction sample; and adding semantic representation of the second target position after the anti-disturbance value to the word vector of the input error correction sample.
7. The method of claim 4, wherein the negative sample is constructed by at least one of: inputting a sample of the easily-mixed label containing the real label of the second target position into a model to obtain semantic representation of the position of the easily-mixed label; semantic characterization of random locations of other random samples is obtained.
8. A method of spelling error correction, the method comprising:
acquiring text data to be corrected;
inputting the text data to be corrected into a spelling correction model generated by adopting the method of any one of claims 1-7, and obtaining the error target in the text data to be corrected and the correction result of the error target.
9. A spelling error correction model generation device, the device comprising:
an error correction acquisition unit configured to acquire an error correction sample set including at least one error correction sample;
the training unit to be adjusted is configured to perform spelling error correction training on the pre-trained text recognition model based on the error correction sample set to obtain an error correction model to be adjusted;
a low-frequency acquisition unit configured to select a low-frequency sample including a low-frequency vocabulary from the error correction sample set to obtain a low-frequency sample set;
the spelling training unit is configured to perform spelling error correction training on the to-be-adjusted error correction model based on the low-frequency sample set to obtain a spelling error correction model; the first contrast learning unit is configured to conduct contrast learning on semantic characterization of a first target position of a low-frequency sample in the low-frequency sample set in the training process of the error correction model to be adjusted, so that first contrast learning loss is obtained;
a first adjustment unit configured to adjust parameters of the error correction model to be adjusted based on the first contrast learning loss;
the first target location is a text correct location, the first contrast learning unit being further configured to: performing similarity comparison by adopting a pre-constructed positive sample and semantic characterization of the correct position of the text to obtain a first positive similarity; performing similarity comparison between a pre-constructed negative sample and semantic representation of the correct position of the text to obtain a first negative similarity; and calculating the first contrast learning loss based on the first positive similarity and the first negative similarity.
10. The apparatus of claim 9, wherein the set of error correction samples comprises: a set of pseudo-error correction subsamples and a set of true-error correction subsamples, the training unit to be tuned further configured to: performing spelling error correction training on the text recognition model by adopting the pseudo error correction sub-sample set to obtain an initial error correction model; and performing spelling error correction training on the initial error correction model by adopting the true error correction sub-sample set to obtain a to-be-adjusted error correction model.
11. The apparatus of claim 10, the apparatus further comprising:
the second contrast learning unit is configured to conduct contrast learning on semantic characterization of a second target position in the error correction sample in the training process of the text recognition model and the initial error correction model, so as to obtain a second contrast learning loss;
and a second adjustment unit configured to adjust parameters of the text recognition model and the initial correction model based on the second contrast learning loss.
12. The apparatus of claim 11, the second target location being a text error location, the second contrast learning unit further configured to: performing similarity comparison by adopting a pre-constructed positive sample and semantic representation of the text error position to obtain a second positive similarity; performing similarity comparison by adopting a pre-constructed negative sample and semantic representation of the text error position to obtain a second negative similarity; and calculating the second contrast learning loss based on the second positive similarity and the second negative similarity.
13. The apparatus of claim 10, wherein the set of pseudo-error correction sub-samples is obtained with a sample aggregation unit configured to: acquiring an initial text sample set; determining the word or word sound or shape near replacement words of each text sample in the initial text sample set; and replacing the characters or words of each text sample in the initial text sample set by the replacement words to obtain a pseudo error correction sub-sample set.
14. The apparatus of claim 12, wherein the positive sample is derived by at least one of the following cell configurations: a truncation unit configured to truncate the input error correction sample and then semantically characterize the second target position; the feedforward unit is configured to obtain semantic representation of the target position after performing additional feedforward process on the input error correction sample by utilizing randomness of the discarding layer in the model; and the adding unit is configured to add semantic representation of the second target position after the anti-disturbance value on the word vector of the input error correction sample.
15. The apparatus of claim 12, wherein the negative sample is derived by at least one of the following cell configurations: the input unit is configured to input a sample of the easily-mixed label containing the real label of the second target position into the model to obtain semantic representation of the position of the easily-mixed label; a random acquisition unit configured to acquire semantic representations of random locations of other random samples.
16. A spelling error correction device, the device comprising:
a text acquisition unit configured to acquire text data to be corrected;
a result obtaining unit configured to input the text data to be corrected into a spelling correction model generated by using the apparatus of any one of claims 9 to 15, and obtain an error target in the text data to be corrected and a correction result of the error target.
17. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.
CN202210546618.2A 2022-05-18 2022-05-18 Spelling error correction model generation method and device, and spelling error correction method and device Active CN114861637B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210546618.2A CN114861637B (en) 2022-05-18 2022-05-18 Spelling error correction model generation method and device, and spelling error correction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210546618.2A CN114861637B (en) 2022-05-18 2022-05-18 Spelling error correction model generation method and device, and spelling error correction method and device

Publications (2)

Publication Number Publication Date
CN114861637A CN114861637A (en) 2022-08-05
CN114861637B true CN114861637B (en) 2023-06-16

Family

ID=82638684

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210546618.2A Active CN114861637B (en) 2022-05-18 2022-05-18 Spelling error correction model generation method and device, and spelling error correction method and device

Country Status (1)

Country Link
CN (1) CN114861637B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114997148B (en) * 2022-08-08 2022-11-04 湖南工商大学 Chinese spelling proofreading pre-training model construction method based on contrast learning
CN116306598B (en) * 2023-05-22 2023-09-08 上海蜜度信息技术有限公司 Customized error correction method, system, equipment and medium for words in different fields

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627158A (en) * 2021-07-02 2021-11-09 南京理工大学 Chinese spelling error correction method and device based on multiple characteristics and multiple pre-training models

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926306B (en) * 2021-03-08 2024-01-23 北京百度网讯科技有限公司 Text error correction method, device, equipment and storage medium
CN113947072A (en) * 2021-10-15 2022-01-18 上海水滴征信服务有限公司 Text error correction method and text error correction device
CN113837370B (en) * 2021-10-20 2023-12-05 贝壳找房(北京)科技有限公司 Method and apparatus for training a model based on contrast learning
CN114387602B (en) * 2022-03-24 2022-07-08 北京智源人工智能研究院 Medical OCR data optimization model training method, optimization method and equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627158A (en) * 2021-07-02 2021-11-09 南京理工大学 Chinese spelling error correction method and device based on multiple characteristics and multiple pre-training models

Also Published As

Publication number Publication date
CN114861637A (en) 2022-08-05

Similar Documents

Publication Publication Date Title
CN113313022B (en) Training method of character recognition model and method for recognizing characters in image
CN114861637B (en) Spelling error correction model generation method and device, and spelling error correction method and device
CN113590796B (en) Training method and device for ranking model and electronic equipment
CN116152833B (en) Training method of form restoration model based on image and form restoration method
CN112559885A (en) Method and device for determining training model of map interest point and electronic equipment
CN114724168A (en) Training method of deep learning model, text recognition method, text recognition device and text recognition equipment
CN115062718A (en) Language model training method and device, electronic equipment and storage medium
CN113420822A (en) Model training method and device and text prediction method and device
CN115359323A (en) Image text information generation method and deep learning model training method
CN114611625A (en) Language model training method, language model training device, language model data processing method, language model data processing device, language model data processing equipment, language model data processing medium and language model data processing product
CN117333889A (en) Training method and device for document detection model and electronic equipment
CN116662484A (en) Text regularization method, device, equipment and storage medium
CN113553833B (en) Text error correction method and device and electronic equipment
CN114201607B (en) Information processing method and device
CN114118049B (en) Information acquisition method, device, electronic equipment and storage medium
CN115600592A (en) Method, device, equipment and medium for extracting key information of text content
CN113361522B (en) Method and device for determining character sequence and electronic equipment
CN115565186A (en) Method and device for training character recognition model, electronic equipment and storage medium
CN114817476A (en) Language model training method and device, electronic equipment and storage medium
CN114417862A (en) Text matching method, and training method and device of text matching model
CN113641724A (en) Knowledge tag mining method and device, electronic equipment and storage medium
CN112784600A (en) Information sorting method and device, electronic equipment and storage medium
CN116244432B (en) Pre-training method and device for language model and electronic equipment
CN115952852B (en) Model training method, text retrieval method, device, electronic equipment and medium
CN115879446B (en) Text processing method, deep learning model training method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant