CN112287670A - Text error correction method, system, computer device and readable storage medium - Google Patents

Text error correction method, system, computer device and readable storage medium Download PDF

Info

Publication number
CN112287670A
CN112287670A CN202011293207.4A CN202011293207A CN112287670A CN 112287670 A CN112287670 A CN 112287670A CN 202011293207 A CN202011293207 A CN 202011293207A CN 112287670 A CN112287670 A CN 112287670A
Authority
CN
China
Prior art keywords
training
text
error correction
masked
soft
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011293207.4A
Other languages
Chinese (zh)
Inventor
陈倩倩
景艳山
郑悦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN202011293207.4A priority Critical patent/CN112287670A/en
Publication of CN112287670A publication Critical patent/CN112287670A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a text error correction method, a system, a computer device and a computer readable storage medium, wherein the method comprises the following steps: a data acquisition step, which is used for acquiring text data to be corrected; a negative sample construction step, which is used for creating a confusion word table and carrying out corpus replacement on the text to be corrected according to the confusion word table to generate a negative sample; a text error correction step, namely taking the text data to be error corrected and the negative sample data as training data, respectively training the Chinese character characteristics and the pinyin characteristics of the training data through a Soft-Masked BERT pre-training model, splicing the training results into training results, and calculating cross entropy loss of the training results through a Softmax layer to obtain error correction results; and model optimization step, which is used for optimizing the Soft-Masked BERT pre-training model through recursive prediction and vocabulary filtering. By the method and the device, the text error correction accuracy is effectively improved, and the model effect and performance are improved.

Description

Text error correction method, system, computer device and readable storage medium
Technical Field
The present application relates to the field of natural language processing, and more particularly, to text error correction methods, systems, computer devices, and computer readable storage media.
Background
The Chinese error correction technology is an important technology for realizing automatic check and automatic error correction of Chinese sentences, and aims to improve the language correctness and reduce the manual check cost. The Chinese character error correction technology mainly corrects the errors according to the similarity of character patterns, is a technology for detecting whether a section of characters have wrongly written characters or not and correcting the wrongly written characters in the field of natural language processing, is generally used in a text preprocessing stage, and can remarkably solve the problem that information acquisition is inaccurate in scenes such as intelligent customer service and the like, for example, wrongly written characters in certain intelligent question and answer scenes can influence query understanding and conversation effects. In the general field, the problem of Chinese text correction is a problem that is sought to be solved from the beginning of the Internet. In a search engine, a good error correction system can perform error correction prompting on a query word input by a user or directly display a correct answer.
In the prior art, text correction is realized by the following tools: (1) a wrongly written word dictionary is constructed; (2) editing distance, wherein the editing distance adopts a method similar to fuzzy matching of character strings, and part of common wrongly written characters and language diseases can be corrected by contrasting correct samples; (3) language models, which may be word or word error correction granularity. In recent years, pre-trained language models have become popular, and researchers have found that BERT (Bidirectional Encoder representation from transforms) type models are migrated into text correction, and that BERT is used to select a character in a candidate list for correction at each position of a sentence.
Among the tools, the labor cost for constructing the wrongly written character dictionary is high, and the method is only suitable for the limited partial vertical field of wrongly written characters; the generality of the way of editing the distance is not sufficient; because the semantic information of the word granularity is relatively weak in the language model, the misjudgment rate is higher than the error correction of the word granularity; the word granularity depends on the accuracy of the word segmentation model, in order to reduce the misjudgment rate, a CRF layer is often added in an output layer of the model for proofreading, and unreasonable wrongly-written and wrongly-written word output is avoided through learning transition probability and a global optimal path. The BERT method is too rough and easily causes high false rate, the mask position of BERT is randomly selected, so that the method is not good at detecting the wrong position in the sentence, and BERT error correction does not consider the constraint condition, so that the accuracy is low.
Disclosure of Invention
The embodiment of the application provides a text error correction method, a text error correction system, computer equipment and a computer readable storage medium, combines Chinese character characteristics and pinyin characteristics of a text, trains by using a Soft-Masked BERT pre-training model, and optimizes the model by combining recursive prediction and vocabulary filtering, so that the error correction accuracy is improved, and the model effect and performance are improved.
In a first aspect, an embodiment of the present application provides a text error correction method, including:
a data acquisition step, which is used for acquiring text data to be corrected;
and a negative sample construction step, configured to create a confusion word table, perform corpus replacement on the text to be corrected according to the confusion word table, and generate a negative sample, specifically, perform random replacement on 15% of characters in the text to be corrected, further, 80% of the characters that are randomly replaced use homophonic characters in the confusion word table, and the rest 20% of the characters are replaced with random characters. The sample data is constructed in the mode and then used for training the model, so that the trained Soft-Masked BERT pre-training model can obtain stronger correcting capability of homophone confusion errors.
And text error correction, namely using the text data to be corrected and the negative sample data as training data, splicing the Chinese character characteristics and the pinyin characteristics of the training data after being trained by a Soft-Masked BERT pre-training model respectively to obtain a training result, and calculating cross entropy loss of the training result through a Softmax layer to obtain an error correction result, wherein the Soft-Masked BERT pre-training model is a newly proposed neural network structure and comprises a detection network and a BERT-based correction network, and the Softmax layer is a function in machine learning and is used for mapping input to real numbers between 0 and 1, and the normalization ensures that the sum is 1.
In some embodiments, the text error correction method further includes a model optimization step, which is used for optimizing the Soft-Masked BERT pre-training model through recursive prediction and vocabulary filtering, so as to add constraints to model training and further improve error correction accuracy and model performance.
Through the steps, a Soft-Masked BERT pre-training model is utilized, specifically, a detection network in the Soft-Masked BERT pre-training model is used for calculating the probability of an error position, and the probability is used as a character characteristic, so that the error correction accuracy rate is improved; the pinyin characteristics of the data are used, the common pinyins of correct characters are similar, and the range of searching correct words is effectively reduced by using the pinyin characteristics; recursive prediction is used, so that the misjudgment rate is reduced; and the vocabulary filtering is used, so that the effect and the performance of the model are improved.
In some of these embodiments, the text correction step further comprises:
a word vector obtaining step, which is used for respectively training the Chinese character characteristics and the pinyin characteristics through a Soft-Masked BERT pre-training model to obtain word vectors of Chinese characters and word vectors of pinyin;
and a cross entropy loss obtaining step, which is used for splicing the word vector of the Chinese character and the word vector of the pinyin, calculating the sum of cross entropy losses of all positions in the training data through a Softmax layer and outputting an error correction result.
In some of these embodiments, the model optimization step further comprises:
a recursive prediction step, configured to input the error-corrected sentence obtained in the text error correction step into the Soft-Masked BERT pre-training model again for recursive error correction, specifically, the recursive prediction step includes at least one recursive process, and if the results of two recursive predictions are the same, the recursive correction step is stopped;
and a word list filtering step, which is used for filtering the word list in the Soft-Masked BERT pre-training model to ensure that the number of search words in the Soft-Masked BERT pre-training model during error correction is less than or equal to 1000, wherein the filtering degree of the word list filtering can be adjusted, and preferably, the words such as articles, prepositions, adverbs or conjunctions in the word list are filtered.
In the embodiment, the Soft-Masked BERT pre-training model is used for one-to-one error correction, that is, if a sentence has a plurality of error positions, only one position may be corrected in one propagation process, and therefore, the error correction accuracy of the whole sentence is improved through recursive prediction. And through the vocabulary filtering step, the effect and the performance of the Soft-Masked BERT pre-training model are obviously improved.
In a second aspect, an embodiment of the present application provides a text correction system, including:
the data acquisition module is used for acquiring text data to be corrected;
and the negative sample construction module is used for creating a confusion word table and performing corpus replacement on the text to be corrected according to the confusion word table to generate a negative sample, specifically, performing random replacement on 15% of characters in the text to be corrected, further, 80% of the characters which are randomly replaced use homophone characters in the confusion word table to replace, and the rest 20% of the characters use random characters to replace. The data set is constructed in the mode and then used for training the model, and the trained Soft-Masked BERT pre-training model can obtain stronger correcting capability of homophone confusion errors.
And the text error correction module is used for taking the text data to be error corrected and the negative sample data as training data, respectively training the Chinese character characteristics and the pinyin characteristics of the training data through a Soft-Masked BERT pre-training model and splicing the training results into training results, and calculating the cross entropy loss of the training results through a Softmax layer to obtain error correction results.
In some embodiments, the text error correction system further includes a model optimization module, which is used for optimizing the Soft-Masked BERT pre-training model through recursive prediction and vocabulary filtering, so as to further improve the error correction accuracy and the model performance.
Through the modules, a Soft-Masked BERT pre-training model is utilized, specifically, a detection network in the Soft-Masked BERT pre-training model is used for calculating the probability of an error position, and the probability is used as a character characteristic, so that the error correction accuracy rate is improved; the pinyin characteristics of the data are used, the common pinyins of correct characters are similar, and the range of searching correct words is effectively reduced by using the pinyin characteristics; recursive prediction is used, so that the misjudgment rate is reduced; and the vocabulary filtering is used, so that the effect and the performance of the model are improved.
In some embodiments, the text correction module further comprises:
the word vector acquisition module is used for respectively carrying out Soft-Masked BERT pre-training model training on the Chinese character characteristics and the pinyin characteristics to obtain word vectors of Chinese characters and word vectors of pinyin;
and the cross entropy loss acquisition module is used for splicing the word vector of the Chinese character and the word vector of the pinyin, calculating the sum of cross entropy losses of all positions in the training data through a Softmax layer and outputting an error correction result.
In some of these embodiments, the model optimization module further comprises:
the recursive prediction module is used for inputting the error-corrected sentences obtained by the text error correction module into the Soft-Masked BERT pre-training model again for recursive error correction, and specifically, if the results of the two recursive predictions are the same, the recursive error correction module is stopped;
and the vocabulary filtering module is used for filtering the vocabulary in the Soft-Masked BERT pre-training model to ensure that the number of search words in the Soft-Masked BERT pre-training model during error correction is less than or equal to 1000, wherein the filtering degree of the vocabulary filtering can be adjusted, and preferably, the words such as articles, prepositions, adverbs or conjunctions in the vocabulary are filtered.
In the embodiment, the Soft-Masked BERT pre-training model is used for one-to-one error correction, that is, if a sentence has a plurality of error positions, only one position may be corrected in one propagation process, and therefore, the error correction accuracy of the whole sentence is improved through recursive prediction. And through the vocabulary filtering step, the effect and the performance of the Soft-Masked BERT pre-training model are obviously improved.
In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the text error correction method according to the first aspect is implemented.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the text error correction method according to the first aspect.
Compared with the prior art, the text error correction method, the text error correction system, the computer equipment and the computer readable storage medium provided by the embodiment of the application improve the error correction effect by using the Soft-Masked BERT pre-training model compared with the original BERT model; in addition, the effect and the performance of the model are obviously improved and the correction accuracy is improved through recursive error correction and vocabulary filtering.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a schematic flow chart diagram of a text error correction method according to an embodiment of the present application;
FIG. 2 is a block diagram illustrating the structure of a text correction system according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a text correction step according to a preferred embodiment of the present application;
FIG. 4 is a schematic diagram of the principle of the Soft-Masked BERT pre-training model according to the preferred embodiment of the present application.
Description of the drawings:
10. a data acquisition module; 20. a negative sample construction module; 30. a text error correction module;
40. a model optimization module;
301. a word vector acquisition module; 302. a cross entropy loss acquisition module;
401. a recursive prediction module; 402. and a word list filtering module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The embodiment provides a text error correction method. Fig. 1 is a schematic flow chart of a text error correction method according to an embodiment of the present application, and referring to fig. 1, the flow chart includes the following steps:
a data obtaining step S10, configured to obtain text data to be corrected;
a negative sample construction step S20, configured to create a confusion word table and perform corpus replacement on the text to be corrected according to the confusion word table to generate a negative sample, specifically, perform random replacement on 15% of the characters in the text to be corrected, further, 80% of the characters that are randomly replaced use homophonic characters in the confusion word table, and the rest 20% of the characters are replaced with random characters. The sample data is constructed in the mode and then used for training the model, so that the trained Soft-Masked BERT pre-training model can obtain stronger correcting capability of homophone confusion errors.
And a text error correction step S30, wherein the text data to be error corrected and the negative sample data are used as training data, the Chinese character characteristics and the pinyin characteristics of the training data are respectively trained by a Soft-Masked BERT pre-training model and then spliced into a training result, and the training result is subjected to cross entropy loss calculation by a Softmax layer to obtain an error correction result.
And a model optimization step S40, which is used for optimizing the Soft-Masked BERT pre-training model through recursive prediction and vocabulary filtering, adding constraints for model training, and further improving the error correction accuracy and the performance of the model.
Through the steps, a Soft-Masked BERT pre-training model is utilized, specifically, a detection network in the Soft-Masked BERT pre-training model is used for calculating the probability of an error position, and the probability is used as a character characteristic, so that the error correction accuracy rate is improved; the pinyin characteristics of the data are used, the common pinyins of correct characters are similar, and the range of searching correct words is effectively reduced by using the pinyin characteristics; recursive prediction is used, so that the misjudgment rate is reduced; and the vocabulary filtering is used, so that the effect and the performance of the model are improved.
In some of these embodiments, the text correction step S30 further includes:
a word vector obtaining step S301, for respectively performing Soft-Masked BERT pre-training model training on the Chinese character features and the pinyin features to obtain word vectors of Chinese characters and word vectors of pinyin;
and a cross entropy loss obtaining step S302, which is used for splicing the word vector of the Chinese character and the word vector of the pinyin, calculating the sum of cross entropy losses of all positions in the training data through a Softmax layer, and outputting an error correction result.
In some of these embodiments, the model optimization step S40 further includes:
a recursive prediction step S401, configured to input the sentence subjected to error correction in step S30 into the Soft-Masked BERT pre-training model again for recursive error correction, and specifically, if the results of two recursive predictions are the same, stop the recursive error correction step;
and a vocabulary filtering step S402, which is used for filtering the vocabulary in the Soft-Masked BERT pre-training model to enable the number of search words in the Soft-Masked BERT pre-training model to be less than or equal to 1000 when the error is corrected, wherein the filtering degree of the vocabulary filtering can be adjusted, and preferably, the words such as articles, prepositions, adverbs or conjunctions in the vocabulary are filtered.
The embodiment considers that the Soft-Masked BERT pre-training model is one-to-one error correction, that is, if a sentence has a plurality of error positions, only one position may be corrected in one propagation process, and therefore, the error correction accuracy of the whole sentence is improved through recursive prediction. Through the vocabulary filtering step, the effect and the performance of the Soft-Masked BERT pre-training model are obviously improved.
The embodiments of the present application are described and illustrated below by means of preferred embodiments.
Fig. 3 is a schematic diagram of a principle of a text error correction step according to a preferred embodiment of the present application, fig. 4 is a schematic diagram of a Soft-Masked BERT pre-training model according to the preferred embodiment of the present application, and with reference to fig. 1, fig. 3, and fig. 4, in this embodiment, text data to be error corrected and negative sample data obtained through steps S10 and S20 are used as training data, chinese character features x1, x2, x3, and x4 of the training data are input to the Soft-Masked BERT pre-training model to perform chinese character Soft-Masked BERT to obtain word vectors of chinese characters, and pinyin features x1 ', x 2', x3 ', and x 4' are word vectors obtained by performing pinyin Soft-Masked BERT retraining through the Soft-Masked BERT pre-training model to obtain pinyin word vectors, and the word vectors are spliced through a Softmax layer to calculate cross entropy loss.
The Soft-Masked BERT pre-training model shown in fig. 4 specifically includes a Detection Network (Detection Network) for predicting the probability that a character is a wrongly written character and a Correction Network (Correction Network) for predicting the probability that an error is corrected, and specifically, the Detection Network is composed of a bidirectional GRU Network (Bi-GRU for short), and is used for sufficiently learning input context information and outputting the probability that each position i may be a wrongly written character, that is, p in the graphiThe larger the value is, the position is shownThe greater the probability of error, where i is a natural number. Next, detecting a Soft Masking part between the network and the correction network, and representing the characteristic of each position by piMultiplying the probability of (1) by the characteristics of the masking character by 1-piThe probability of (2) is multiplied by the original input word vector feature, and the last two parts are added as the feature of each character, and the above process can be expressed as:
ei′=pi·emask+(1-pi)·ei
in the formula, ei' feature of each character output, emaskAs a feature of the masking character, eiFor input of word vector features, piThe probability that the ith character is a wrongly written character is shown, and i is a natural number.
Mixing the above ei' inputting into a correction network, the correction network is a BERT-based sequence multi-classification label model, and the final characteristic representation of each character obtained by the correction network is the output of the last layer and the word vector characteristic eiThe loss function is weighted by the detection network and the correction network, via a representation of the residual connection.
For example and without limitation, if the text data to be corrected is "machine learning is a known part of artificial intelligence", the sentence and the negative sample are input into a Soft-Masked BERT pre-training model, the probability that each position is possibly a wrong character is output by a detection network, the "known" is the wrong character, the output probability is higher, the Soft-Masked of the Soft-Masked BERT pre-training model is a Soft-Masking part, the probability that each position is wrong is combined with Masking character characteristics and output to a correction network, the correction network outputs the character with the maximum probability from the word list candidate words as a correct character, and the text to be corrected is "known" to be corrected, so that text correction is realized.
It should be noted that the steps shown in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system such as a set of computer-executable instructions, and that, although a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.
The embodiment also provides a text error correction system. Fig. 2 is a block diagram of a structure of a text correction system according to an embodiment of the present application. As shown in fig. 2, the text correction system includes: the system comprises a data acquisition module 10, a negative sample construction module 20, a text error correction module 30, a model optimization module 40 and the like. Those skilled in the art will appreciate that the configuration of the text correction system shown in FIG. 2 does not constitute a limitation of the text correction system, and may include more or fewer modules than shown, or some modules in combination, or a different arrangement of modules.
The following describes each constituent module of the text correction system in detail with reference to fig. 2:
the data acquisition module 10 is used for acquiring text data to be corrected;
the negative sample construction module 20 is configured to create a confusion word table and perform corpus replacement on the text to be corrected according to the confusion word table to generate a negative sample, specifically, 15% of the characters in the text to be corrected are randomly replaced, further, 80% of the characters in the 15% randomly replaced characters are replaced with homophonic characters in the confusion word table, and the rest 20% of the characters are replaced with random characters. The data set is constructed in the mode and then used for training the model, and the trained Soft-Masked BERT pre-training model can obtain stronger correcting capability of homophone confusion errors.
The text error correction module 30 is used for taking text data to be corrected and negative sample data as training data, respectively training the Chinese character characteristics and the pinyin characteristics of the training data through a Soft-Masked BERT pre-training model and then splicing the training results into training results, and calculating cross entropy loss of the training results through a Softmax layer to obtain error correction results;
and the model optimization module 40 is used for optimizing the Soft-Masked BERT pre-training model through recursive prediction and vocabulary filtering, and further improving the error correction accuracy and the model performance.
Wherein the text error correction module 30 further comprises: the word vector acquisition module 301 is configured to perform Soft-Masked BERT pre-training model training on the Chinese character features and the pinyin features respectively to obtain word vectors of the Chinese characters and word vectors of the pinyin; the cross entropy loss obtaining module 302 is configured to splice word vectors of the chinese characters and word vectors of the pinyin, calculate a sum of cross entropy losses at each position in the training data through a Softmax layer, and output an error correction result. The model optimization module 40 further comprises: the recursive prediction module 401 is configured to input the error-corrected sentence into the Soft-Masked BERT pre-training model again for recursive error correction, and specifically, if the results of the two recursive predictions are the same, the recursive correction module is stopped; the vocabulary filtering module 402 is configured to filter a vocabulary in the Soft-Masked BERT pre-training model, so that the number of search words in the Soft-Masked BERT pre-training model is less than or equal to 1000, where the filtering degree of the vocabulary filtering may be adjusted, and preferably, words such as articles, prepositions, adverbs or conjunctions in the vocabulary are filtered.
The embodiment considers that the Soft-Masked BERT pre-training model is one-to-one error correction, that is, if a sentence has a plurality of error positions, only one position may be corrected in one propagation process, and therefore, the error correction accuracy of the whole sentence is improved through recursive prediction. Through the vocabulary filtering step, the effect and the performance of the Soft-Masked BERT pre-training model are obviously improved.
In addition, the text error correction method described in conjunction with fig. 1 in the embodiment of the present application may be implemented by a computer device. The computer device may include a processor and a memory storing computer program instructions.
In particular, the processor may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.
The memory may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. The memory may include removable or non-removable (or fixed) media, where appropriate. The memory may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory is a Non-Volatile (Non-Volatile) memory. In particular embodiments, the Memory includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.
The memory may be used to store or cache various data files for processing and/or communication use, as well as possibly computer program instructions for execution by the processor.
The processor may read and execute the computer program instructions stored in the memory to implement any of the text error correction methods in the above embodiments.
In addition, in combination with the text error correction method in the foregoing embodiment, the embodiment of the present application may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the text correction methods in the above embodiments.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A text error correction method, comprising:
a data acquisition step, which is used for acquiring text data to be corrected;
a negative sample construction step, which is used for creating a confusion word table and carrying out corpus replacement on the text to be corrected according to the confusion word table to generate a negative sample;
and a text error correction step, namely taking the text data to be error corrected and the negative sample data as training data, respectively training the Chinese character characteristics and the pinyin characteristics of the training data through a Soft-Masked BERT pre-training model, splicing the training results into training results, and calculating cross entropy loss through a Softmax layer to obtain error correction results.
2. The text error correction method according to claim 1, further comprising a model optimization step for optimizing the Soft-Masked BERT pre-training model by recursive prediction and vocabulary filtering.
3. The text correction method of claim 1, wherein the text correction step further comprises:
a word vector obtaining step, which is used for respectively training the Chinese character characteristics and the pinyin characteristics through a Soft-Masked BERT pre-training model to obtain word vectors of Chinese characters and word vectors of pinyin;
and a cross entropy loss obtaining step, which is used for splicing the word vector of the Chinese character and the word vector of the pinyin, calculating the sum of cross entropy losses of all positions in the training data through a Softmax layer and outputting an error correction result.
4. The text correction method of claim 2 wherein the model optimization step further comprises:
a recursive prediction step, which is used for inputting the error-corrected sentences obtained in the text error correction step into the Soft-Masked BERT pre-training model again for recursive error correction;
and a word list filtering step, which is used for filtering the word list in the Soft-Masked BERT pre-training model, so that the number of search words in the Soft-Masked BERT pre-training model is less than or equal to 1000 when the error is corrected.
5. A text correction system, comprising:
the data acquisition module is used for acquiring text data to be corrected;
the negative sample construction module is used for creating a confusion word table and performing corpus replacement on the text to be corrected according to the confusion word table to generate a negative sample;
and the text error correction module is used for taking the text data to be error corrected and the negative sample data as training data, respectively training the Chinese character characteristics and the pinyin characteristics of the training data through a Soft-Masked BERT pre-training model and splicing the training results into training results, and calculating the cross entropy loss of the training results through a Softmax layer to obtain error correction results.
6. The text correction system of claim 5, further comprising a model optimization module for optimizing the Soft-Masked BERT pre-training model by recursive prediction and vocabulary filtering.
7. The text correction system of claim 5, wherein the text correction module further comprises:
the word vector acquisition module is used for respectively carrying out Soft-Masked BERT pre-training model training on the Chinese character characteristics and the pinyin characteristics to obtain word vectors of Chinese characters and word vectors of pinyin;
and the cross entropy loss acquisition module is used for splicing the word vector of the Chinese character and the word vector of the pinyin, calculating the sum of cross entropy losses of all positions in the training data through a Softmax layer and outputting an error correction result.
8. The text correction system of claim 6, wherein the model optimization module further comprises:
the recursive prediction module is used for inputting the error-corrected sentences obtained by the text error correction module into the Soft-Masked BERT pre-training model again for recursive error correction;
and the vocabulary filtering module is used for filtering the vocabulary in the Soft-Masked BERT pre-training model to ensure that the number of search words is less than or equal to 1000 when the Soft-Masked BERT pre-training model is corrected.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the text correction method according to any one of claims 1 to 4 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a text correction method according to any one of claims 1 to 4.
CN202011293207.4A 2020-11-18 2020-11-18 Text error correction method, system, computer device and readable storage medium Pending CN112287670A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011293207.4A CN112287670A (en) 2020-11-18 2020-11-18 Text error correction method, system, computer device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011293207.4A CN112287670A (en) 2020-11-18 2020-11-18 Text error correction method, system, computer device and readable storage medium

Publications (1)

Publication Number Publication Date
CN112287670A true CN112287670A (en) 2021-01-29

Family

ID=74398422

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011293207.4A Pending CN112287670A (en) 2020-11-18 2020-11-18 Text error correction method, system, computer device and readable storage medium

Country Status (1)

Country Link
CN (1) CN112287670A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560451A (en) * 2021-02-20 2021-03-26 京华信息科技股份有限公司 Wrongly written character proofreading method and device for automatically generating training data
CN113011126A (en) * 2021-03-11 2021-06-22 腾讯科技(深圳)有限公司 Text processing method and device, electronic equipment and computer readable storage medium
CN113051894A (en) * 2021-03-16 2021-06-29 京东数字科技控股股份有限公司 Text error correction method and device
CN113065339A (en) * 2021-04-12 2021-07-02 平安国际智慧城市科技股份有限公司 Automatic error correction method, device and equipment for Chinese text and storage medium
CN113449514A (en) * 2021-06-21 2021-09-28 浙江康旭科技有限公司 Text error correction method and device suitable for specific vertical field
CN114023306A (en) * 2022-01-04 2022-02-08 阿里云计算有限公司 Processing method for pre-training language model and spoken language understanding system
CN114239559A (en) * 2021-11-15 2022-03-25 北京百度网讯科技有限公司 Method, apparatus, device and medium for generating text error correction and text error correction model
CN114676684A (en) * 2022-03-17 2022-06-28 平安科技(深圳)有限公司 Text error correction method and device, computer equipment and storage medium
CN114970502A (en) * 2021-12-29 2022-08-30 中科大数据研究院 Text error correction method applied to digital government
CN115455948A (en) * 2022-11-11 2022-12-09 北京澜舟科技有限公司 Spelling error correction model training method, spelling error correction method and storage medium
CN115630634A (en) * 2022-12-08 2023-01-20 深圳依时货拉拉科技有限公司 Text error correction method and device, electronic equipment and storage medium
CN115659958A (en) * 2022-12-27 2023-01-31 中南大学 Chinese spelling error checking method
WO2023093525A1 (en) * 2021-11-23 2023-06-01 中兴通讯股份有限公司 Model training method, chinese text error correction method, electronic device, and storage medium
WO2023184633A1 (en) * 2022-03-31 2023-10-05 上海蜜度信息技术有限公司 Chinese spelling error correction method and system, storage medium, and terminal
WO2024045527A1 (en) * 2022-09-02 2024-03-07 美的集团(上海)有限公司 Word/sentence error correction method and device, readable storage medium, and computer program product

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015075706A (en) * 2013-10-10 2015-04-20 日本放送協会 Error correction model learning device and program
CN108874174A (en) * 2018-05-29 2018-11-23 腾讯科技(深圳)有限公司 A kind of text error correction method, device and relevant device
CN110046350A (en) * 2019-04-12 2019-07-23 百度在线网络技术(北京)有限公司 Grammatical bloopers recognition methods, device, computer equipment and storage medium
CN110457688A (en) * 2019-07-23 2019-11-15 广州视源电子科技股份有限公司 Correction processing method and device, storage medium and processor
CN110489760A (en) * 2019-09-17 2019-11-22 达而观信息科技(上海)有限公司 Based on deep neural network text auto-collation and device
CN111523306A (en) * 2019-01-17 2020-08-11 阿里巴巴集团控股有限公司 Text error correction method, device and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015075706A (en) * 2013-10-10 2015-04-20 日本放送協会 Error correction model learning device and program
CN108874174A (en) * 2018-05-29 2018-11-23 腾讯科技(深圳)有限公司 A kind of text error correction method, device and relevant device
CN111523306A (en) * 2019-01-17 2020-08-11 阿里巴巴集团控股有限公司 Text error correction method, device and system
CN110046350A (en) * 2019-04-12 2019-07-23 百度在线网络技术(北京)有限公司 Grammatical bloopers recognition methods, device, computer equipment and storage medium
CN110457688A (en) * 2019-07-23 2019-11-15 广州视源电子科技股份有限公司 Correction processing method and device, storage medium and processor
CN110489760A (en) * 2019-09-17 2019-11-22 达而观信息科技(上海)有限公司 Based on deep neural network text auto-collation and device

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560451B (en) * 2021-02-20 2021-05-14 京华信息科技股份有限公司 Wrongly written character proofreading method and device for automatically generating training data
CN112560451A (en) * 2021-02-20 2021-03-26 京华信息科技股份有限公司 Wrongly written character proofreading method and device for automatically generating training data
CN113011126B (en) * 2021-03-11 2023-06-30 腾讯科技(深圳)有限公司 Text processing method, text processing device, electronic equipment and computer readable storage medium
CN113011126A (en) * 2021-03-11 2021-06-22 腾讯科技(深圳)有限公司 Text processing method and device, electronic equipment and computer readable storage medium
CN113051894A (en) * 2021-03-16 2021-06-29 京东数字科技控股股份有限公司 Text error correction method and device
CN113065339A (en) * 2021-04-12 2021-07-02 平安国际智慧城市科技股份有限公司 Automatic error correction method, device and equipment for Chinese text and storage medium
CN113449514A (en) * 2021-06-21 2021-09-28 浙江康旭科技有限公司 Text error correction method and device suitable for specific vertical field
CN113449514B (en) * 2021-06-21 2023-10-31 浙江康旭科技有限公司 Text error correction method and device suitable for vertical field
CN114239559A (en) * 2021-11-15 2022-03-25 北京百度网讯科技有限公司 Method, apparatus, device and medium for generating text error correction and text error correction model
CN114239559B (en) * 2021-11-15 2023-07-11 北京百度网讯科技有限公司 Text error correction and text error correction model generation method, device, equipment and medium
WO2023093525A1 (en) * 2021-11-23 2023-06-01 中兴通讯股份有限公司 Model training method, chinese text error correction method, electronic device, and storage medium
CN114970502B (en) * 2021-12-29 2023-03-28 中科大数据研究院 Text error correction method applied to digital government
CN114970502A (en) * 2021-12-29 2022-08-30 中科大数据研究院 Text error correction method applied to digital government
CN114023306A (en) * 2022-01-04 2022-02-08 阿里云计算有限公司 Processing method for pre-training language model and spoken language understanding system
CN114676684A (en) * 2022-03-17 2022-06-28 平安科技(深圳)有限公司 Text error correction method and device, computer equipment and storage medium
CN114676684B (en) * 2022-03-17 2024-02-02 平安科技(深圳)有限公司 Text error correction method and device, computer equipment and storage medium
WO2023184633A1 (en) * 2022-03-31 2023-10-05 上海蜜度信息技术有限公司 Chinese spelling error correction method and system, storage medium, and terminal
WO2024045527A1 (en) * 2022-09-02 2024-03-07 美的集团(上海)有限公司 Word/sentence error correction method and device, readable storage medium, and computer program product
CN115455948A (en) * 2022-11-11 2022-12-09 北京澜舟科技有限公司 Spelling error correction model training method, spelling error correction method and storage medium
CN115630634A (en) * 2022-12-08 2023-01-20 深圳依时货拉拉科技有限公司 Text error correction method and device, electronic equipment and storage medium
CN115659958A (en) * 2022-12-27 2023-01-31 中南大学 Chinese spelling error checking method

Similar Documents

Publication Publication Date Title
CN112287670A (en) Text error correction method, system, computer device and readable storage medium
CN109885660B (en) Knowledge graph energizing question-answering system and method based on information retrieval
CN109840287B (en) Cross-modal information retrieval method and device based on neural network
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN111159412B (en) Classification method, classification device, electronic equipment and readable storage medium
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN109271524B (en) Entity linking method in knowledge base question-answering system
CN111599340A (en) Polyphone pronunciation prediction method and device and computer readable storage medium
CN110825857A (en) Multi-turn question and answer identification method and device, computer equipment and storage medium
US11934781B2 (en) Systems and methods for controllable text summarization
CN115328756A (en) Test case generation method, device and equipment
CN112199473A (en) Multi-turn dialogue method and device in knowledge question-answering system
CN110866095A (en) Text similarity determination method and related equipment
CN113158687B (en) Semantic disambiguation method and device, storage medium and electronic device
CN110929532B (en) Data processing method, device, equipment and storage medium
CN112307048A (en) Semantic matching model training method, matching device, equipment and storage medium
CN113779190B (en) Event causal relationship identification method, device, electronic equipment and storage medium
CN111611791B (en) Text processing method and related device
CN113705207A (en) Grammar error recognition method and device
CN112632956A (en) Text matching method, device, terminal and storage medium
CN115858776B (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN110162615A (en) A kind of intelligent answer method, apparatus, electronic equipment and storage medium
CN114239555A (en) Training method of keyword extraction model and related device
CN113128224B (en) Chinese error correction method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination