CN113051894A - Text error correction method and device - Google Patents

Text error correction method and device Download PDF

Info

Publication number
CN113051894A
CN113051894A CN202110279919.9A CN202110279919A CN113051894A CN 113051894 A CN113051894 A CN 113051894A CN 202110279919 A CN202110279919 A CN 202110279919A CN 113051894 A CN113051894 A CN 113051894A
Authority
CN
China
Prior art keywords
text
corrected
character
vector representation
error
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110279919.9A
Other languages
Chinese (zh)
Inventor
王培英
陈蒙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JD Digital Technology Holdings Co Ltd
Original Assignee
JD Digital Technology Holdings Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JD Digital Technology Holdings Co Ltd filed Critical JD Digital Technology Holdings Co Ltd
Priority to CN202110279919.9A priority Critical patent/CN113051894A/en
Publication of CN113051894A publication Critical patent/CN113051894A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention discloses a text error correction method and device, and relates to the technical field of computers. The specific implementation mode of the method comprises the following steps: acquiring a text to be corrected, and generating character pinyin for the text to be corrected; retrieving a domain entry for the text to be corrected in a preset domain knowledge base; inputting a text to be corrected, character pinyin and field entries into a text correction model, wherein the text correction model is obtained by training a training sample, the training sample comprises an error text and a corresponding correct text, and training information input for the text correction model comprises: the method comprises the following steps of (1) obtaining a wrong text, a character pinyin of the wrong text and a field entry of the wrong text; and correcting the error of the text to be corrected by using the text error correction model. The embodiment can improve the accuracy and efficiency of text error correction.

Description

Text error correction method and device
Technical Field
The invention relates to the technical field of computers, in particular to a text error correction method and device.
Background
In many application scenarios, such as retrieval, text conversion, intention recognition, intelligent customer service, etc., text correction (i.e., a process of correcting errors in a text) is involved, and lexical analysis, intention recognition, etc. can be accurately performed on the text in a downstream processing process, so that the text correction plays a role in protecting driving from the perspective of the whole natural language processing technology.
Currently, text error correction generally relies on a manually constructed wrongly written dictionary for error matching and correction.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
due to the limitation of the wrong dictionary, some rare proper nouns and the like may not be included in the wrong dictionary, which results in lower accuracy and efficiency of text error correction.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for text error correction, which can effectively improve accuracy and efficiency of text error correction.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a text error correction method, including:
acquiring a text to be corrected, and generating character pinyin for the text to be corrected;
retrieving a domain entry for the text to be corrected in a preset domain knowledge base;
inputting the text to be corrected, the character pinyin and the field entries into a text correction model, wherein the text correction model is obtained by training a training sample, the training sample comprises an error text and a correct text corresponding to the error text, and the training information input for the text correction model comprises: the error text, the character pinyin of the error text and the domain entries of the error text;
and correcting the text to be corrected by using the text correction model, and outputting the corrected correct text.
Preferably, matching the domain entries for the text to be corrected includes:
dividing the text to be corrected into a plurality of character segments with preset lengths;
and retrieving a domain entry for the character segment in a preset domain knowledge base according to the character pinyin corresponding to the character segment.
Preferably, the text error correction method further includes:
respectively converting the text to be corrected, the character pinyin and the field entries into corresponding vector representations;
correcting the text to be corrected, including:
inputting the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry into the text correction model;
the text error correction model calculates the output probability distribution of characters based on the vector representation of the text to be error corrected, the vector representation of the character pinyin and the vector of the field entry;
and determining the characters included in the correct text according to the output probability distribution of the characters.
Preferably, calculating an output probability distribution of the character comprises:
respectively encoding the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry by using the encoder;
inputting the result of the encoding into a decoder comprised by the text correction model;
and the decoder calculates the output probability distribution of the characters according to the encoding result.
Preferably, the encoding, by the encoder, the vector representation of the text to be corrected, the vector representation of the character pinyin, and the vector representation of the field entry respectively includes:
the vector representation of the field entry is blended into the vector representation of the text to be corrected and the vector representation of the character pinyin;
and encoding the merged result.
Preferably, the text error correction method further includes:
determining a confusion set of each character in the text to be corrected, wherein the confusion set comprises a plurality of approximate characters;
based on the set of aliases, the decoder performs the step of calculating an output probability distribution for the character.
Preferably, calculating an output probability distribution of the character comprises:
calculating the copy probability of each character included in the text to be corrected based on the text to be corrected and the field entry;
calculating the generation probability of the characters included in the word list based on the confusion set corresponding to each character;
and calculating the output probability distribution of each character according to the generation probability of the character included in the word list and the copying probability of each character included in the text to be corrected.
Preferably, the text error correction method further includes:
constructing a confusion set indication matrix for the text to be corrected according to the confusion set corresponding to each character;
the output range in the generating mode can be limited in the confusion set through the confusion set indication matrix, and the step of calculating the generating probability of each character included in the word list is executed.
Preferably, the text error correction method further includes:
constructing a loss function by using the output probability of each training sample;
training model parameters by minimizing the value of the loss function to obtain the text error correction model.
In a second aspect, an embodiment of the present invention provides an apparatus for correcting text, including: a text processing module, a domain matching module, and a text correction module, wherein,
the text processing module is used for acquiring a text to be corrected and generating character pinyin for the text to be corrected;
the domain matching module is used for dividing the text to be corrected into a plurality of character segments with preset lengths, and retrieving domain entries for the character segments in a preset domain knowledge base according to character pinyin corresponding to the character segments;
the text error correction module is used for inputting the text to be corrected, the character pinyin and the field entries into a text error correction model, correcting the text to be corrected by using the text error correction model and outputting a corrected text; the text correction model is obtained by training a training sample, the training sample comprises an error text and a correct text corresponding to the error text, and the training information input for the text correction model comprises: the wrong text, the character pinyin of the wrong text and the domain entries of the wrong text.
One embodiment of the above invention has the following advantages or benefits: according to the scheme provided by the application, the character pinyin is generated for the text to be corrected, the text to be corrected is divided into a plurality of character segments with preset lengths, the field entries are retrieved for the character segments in the preset field knowledge base according to the character pinyin corresponding to the character segments, namely the character pinyin and the field entries are introduced into the text to be corrected, and the introduction of the character pinyin and the field entries can increase characteristics for the text to be corrected and can reduce the range of copying or generating correct text of a text correction model, so that the accuracy of text correction and the efficiency of text correction are effectively improved.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of a main flow of a method of text error correction according to an embodiment of the invention;
FIG. 2 is a schematic diagram of the main structure of text correction according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a main flow of error correction of a text to be corrected according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a main flow of computing output probabilities of characters of a fusion domain entry according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a main flow of computing output probabilities of characters of a fusion domain entry based on a confusion set, according to an embodiment of the invention;
FIG. 6 is a schematic diagram of a main flow of computing output probabilities of characters of a fusion domain entry based on a confusion set, according to another embodiment of the present invention;
FIG. 7 is a schematic diagram of the main blocks of an apparatus for text correction according to an embodiment of the present invention;
FIG. 8 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 9 is a schematic structural diagram of a computer system suitable for implementing a terminal device or a server according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a method for text error correction according to an embodiment of the present invention, and as shown in fig. 1, the method for text error correction may include the following steps:
step S101: acquiring a text to be corrected, and generating character pinyin for the text to be corrected;
because characters in the text usually have a plurality of homophones, the error correction direction of the text to be corrected can be better expanded through the character pinyin. For example, the text to be corrected is "the washing machine is a rainbow unit", and "zhe kuan xi ji shi xi hong yi ti de ma" can be generated through the step. Because the pinyin can correspond to various homophones, the character pinyin is introduced through the step, and more error correction features can be provided for the later.
Step S102: retrieving a domain entry for the text to be corrected in a preset domain knowledge base;
step S103: inputting a text to be corrected, character pinyin and field entries into a text correction model, wherein the text correction model is obtained by training a training sample, the training sample comprises an error text and a correct text corresponding to the error text, and training information input for the text correction model comprises: the method comprises the following steps of (1) obtaining a wrong text, a character pinyin of the wrong text and a field entry of the wrong text;
step S104: and correcting the text to be corrected by using the text correction model, and outputting the corrected correct text.
The text to be corrected can be derived from information input by the user into the search box, information input by the user on the intelligent question and answer page, primary text converted from voice and the like. The primary text converted from the voice is the text directly converted by the existing voice text conversion technology, and due to the limitation of the voice conversion technology, the converted primary text may have character errors.
The domain knowledge base refers to a set containing a series of professional words in the domain, such as a property knowledge base of the commodity (in which information such as model series and property nouns of the commodity is listed). The domain knowledge base may include various conventional domain knowledge and the data of the lesser masses, the data distributed in the long tail (the data distributed in the long tail refers to the words which are peculiar, common and rare in a certain domain), etc.
The text error correction model combines a model framework in a Network Machine Translation (NMT) task and a Transformer framework in a self-attention mechanism. The model framework adopts an end-to-end text error correction model combined with domain knowledge. The model framework in the NMT task mainly adopts a sequence model of an encoder and a decoder (encoder-decoder), wherein the encoder is mainly responsible for encoding a source language to obtain a final feature vector, and the decoder generates a target language sequence, namely an error correction text, according to the feature vector information. The method and the device improve the text error correction efficiency by introducing a self-attention mechanism transform framework into a sequence model of an encoder & decoder (encoder-decoder) to determine whether characters in corrected correct text are from copying or generating.
In the embodiment shown in fig. 1, in the scheme provided by the application, character pinyins are generated for the text to be corrected, the text to be corrected is divided into a plurality of character segments with preset lengths, and according to the character pinyins corresponding to the character segments, the field entries are matched for the character segments in a preset field knowledge base, that is, the character pinyins and the field entries are introduced into the text to be corrected, so that on one hand, features can be added to the text to be corrected, on the other hand, the range of copying or generating a correct text by a text correction model can be reduced, and therefore, the accuracy of text correction and the efficiency of text correction are effectively improved.
The specific implementation of step S102 may include: dividing a text to be corrected into a plurality of character segments with preset lengths; and retrieving the domain entries for the character segments in a preset domain knowledge base according to the character pinyin corresponding to the character segments. The retrieved domain entry is a domain entry associated with the character fragment. The character segment with the preset length can be a character segment with a length of 2 characters or a character segment with a length of 3 characters, and the preset length can be correspondingly set according to the requirements of users. For example, for the text to be corrected, "the washing machine is a rainbow, the text is divided into character segments with a length of 3 characters: the "washing machine" and the "rainbow one" are used for filtering out conventional nonsense words, such as "yes" and "no", in the text to be corrected. Aiming at the pinyin xi hong yi of the character segment of the rainbow I, the field entries are matched in the field knowledge base, and the washing and drying are integrated. Namely, the text to be corrected is divided into l character segments through the process, and accordingly, a domain entry set can be obtained:
K={k1,k2,…,kl}。
the main architecture of text correction by the text correction model can be as shown in fig. 2. As can be seen from fig. 2, in the solution provided by the embodiment of the present invention, the encoder encodes the text to be corrected (for example, "this washing machine is a rainbow-integrated machine"), the pinyin of the text to be corrected (for example, "zhe kuan xi yi ji shi xi hong yi ti ma") and the related domain entry (for example, "washing and drying integrated machine"), and the domain knowledge is merged into the code of the original text by a cross attention mechanism, the copying probability of each character in the text to be corrected is calculated (wherein cross attention means that the domain knowledge is merged into the code of the original text), and calculating the generation probability of each character in the text to be corrected through a decoder, obtaining the output probability of each character by using the copy probability and the generation probability of each character, and outputting the corrected correct text according to the output probability.
For the main process of the scheme provided by the embodiment of the present invention shown in fig. 2, the embodiments of the present invention respectively improve the processes of encoding, decoding, calculating the duplication probability, calculating the generation probability, calculating the output probability distribution, and the like.
In an embodiment of the present invention, the text error correction method may further include: respectively converting the text to be corrected, the character pinyin and the field entries into corresponding vector representations; accordingly, as shown in fig. 3, a specific embodiment of correcting the text to be corrected may include the following steps:
step S301: inputting the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry into a text correction model;
the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry can be realized by the conventional text vector conversion mode, for example, the text to be corrected and the character pinyin are converted into a vector representation set directly through an encoder
Figure BDA0002978373720000071
And setting the domain term K to { K ═ K1,k2,…,klConverting each field entry into vector representation to obtain a vector representation set of the field entries
Figure BDA0002978373720000072
Step S302: the text error correction model calculates the output probability distribution of the characters based on the vector representation of the text to be error corrected, the vector representation of the character pinyin and the vector of the field entry;
the vector representation of the character pinyin and the vector of the field entry are introduced in the process of calculating the output probability distribution of the characters, namely the characteristics of the text to be corrected are increased, so that the accuracy of calculating the output probability can be effectively improved.
Step S303: and determining the characters included in the correct text according to the output probability distribution of the characters.
In an embodiment of the present invention, as shown in fig. 4, the text error correction method may further include the following steps:
step S401: respectively coding vector representation of a text to be corrected, vector representation of character pinyin and vector representation of a field entry by using a coder;
the specific implementation manner of step S401 may include: the vector representation of the field entry is blended into the vector representation of the text to be corrected and the vector representation of the character pinyin; and encoding the merged result.
The encoding process can be realized by adopting the following calculation formula (1) and calculation formula (2) to obtain the character vector representation of the fusion domain entries.
Figure BDA0002978373720000081
Figure BDA0002978373720000082
Wherein alpha isijRepresenting the attention weight of the fusion of the ith character and the jth field entry in the text to be corrected; softmax () characterizes the softmax function; wq、WkAnd WvCharacterizing the parameter matrix trained by the encoder, d characterizing
Figure BDA0002978373720000083
Dimension (d);
Figure BDA0002978373720000084
representing the vector representation of the ith character in the text to be corrected;
Figure BDA0002978373720000085
characterizing a vector representation of a jth character in the retrieved domain entry for the ith character; h isiA character representation characterizing the fused knowledge.
That is, the result of the above calculation formula (2) is the output result of the encoder (the character of the fusion knowledge represents the corresponding code).
Step S402: inputting the result of the encoding into a decoder comprised by the text error correction model;
step S403: and a step of calculating the output probability distribution of the character by the decoder according to the encoding result.
Because the field entries are introduced into the coding, the decoder can more accurately calculate the output probability distribution of the characters in the decoding process.
In an embodiment of the present invention, as shown in fig. 5, the text error correction method may further include the following steps:
step S501: determining a confusion set of each character in the text to be corrected, wherein the confusion set comprises a plurality of approximate characters;
the specific implementation manner of step S501 may be: the obfuscated sets of individual characters are stored in a database in advance, as shown in table 1 below. And searching the confusion set of each character in the text to be corrected in a searching mode.
TABLE 1
Character(s) Obfuscated collections
Lifting of wine Shengsheng nephew Sheng … Sheng rope-stay Shengsheng province
Article (Chinese character) Weak mosquito smelling pattern … for preventing mosquito from being asked
….
Heart with heart-shaped Xinxinxinxin zinc core … dispute Xinxiong
The specific implementation manner of step S501 may also be: and searching an approximate character, a near-phonetic character and a near-form character for each character in the text to be corrected, and combining the searched approximate character, near-phonetic character and near-form character into a confusion set of corresponding characters.
Step S502: based on the confusion set, a step of calculating output probabilities of characters in the text to be corrected is performed.
As shown in fig. 6, the step S502 described above performs a process of calculating an output probability of a character in a text to be corrected, and may include the following steps:
step S601: calculating the copying probability of each character included in the text to be corrected based on the text to be corrected and the field entries;
the calculation process of the step is calculated by the following calculation formula (3).
Figure BDA0002978373720000091
Wherein the content of the first and second substances,
Figure BDA0002978373720000092
representing the replication probability of the ith character in the text to be corrected at the t-th decoding moment (the replication probability refers to the probability of replicating the ith character in the text to be corrected from a source end (the text to be corrected and a domain entry);
Figure BDA0002978373720000093
Figure BDA0002978373720000094
wherein s istRepresenting a hidden state of decoding at the moment t; wqAnd WkRepresenting a parameter matrix obtained by training;
Figure BDA0002978373720000095
the representation of the original input text (text to be corrected) and the coded representation of the domain knowledge are represented.
Step S602: calculating the generation probability of the characters included in the word list based on the confusion set corresponding to each character;
the vocabulary refers to a vocabulary formed by characters in a confusion set corresponding to each character.
The specific implementation manner of step S602 may include: constructing a confusion set indication matrix for the text to be corrected according to the confusion set corresponding to each character; by the confusion set indication matrix, the output range in the generation mode can be limited in the confusion set, and the generation probability of each character in the word list is calculated.
Confusion set indication matrix M ∈ Rg*|V|The | V | represents the number of characters included in all confusion sets corresponding to the text to be corrected, and g represents the length of the text to be corrected; element M in MifTake the value 0 or 1. Wherein, the element M in MifThe value can be calculated by the following calculation formula (4).
Figure BDA0002978373720000101
By calculating the generation probability of each character by the calculation formula (5) using the confusion set indication matrix M, the output range in the generation mode can be limited to the confusion set:
Figure BDA0002978373720000102
Figure BDA0002978373720000103
representing the generation probability of the text to be corrected after the confusion set at the time t is subjected to error correction decoding;
Figure BDA0002978373720000104
representing a reference generation probability at the t moment of correcting the error of the text to be corrected, which is preset in a decoder; miA new confusion set indication matrix composed of the elements related to the character i obtained from the confusion set indication matrix (i.e. a new confusion set indication matrix formed by the ith row in the confusion set indication matrix) is characterized.
Step S603: and calculating the output probability distribution of each character according to the generation probability of each character included in the word list and the copying probability of each character included in the text to be corrected.
Each of the characters is a character included in the vocabulary and a character included in the text to be corrected.
In this step, the calculation output probability can be calculated by the following calculation formula (6).
Figure BDA0002978373720000105
Wherein, Pt(i) Representing the output probability of the ith character in the text to be corrected at the time t when the text to be corrected is subjected to error correction decoding; beta represents the weight of the copy mode obtained by training, and the weight is used as a balance factor generated from the confusion word list and copied from the text to be corrected during decoding;
Figure BDA0002978373720000111
characterizing the result calculated by the above calculation formula (5); the product is
Figure BDA0002978373720000112
And representing the copying probability of the ith character in the text to be corrected at the t-th moment when the text to be corrected is subjected to error correction decoding. Calculated by the above calculation formula (3). The copy probability is the probability of copying from the source (with error correction text, domain entry).
In an embodiment of the present invention, the text error correction method may further include: constructing a loss function by using the output probability of each training sample; the model parameters are trained by minimizing the value of the loss function to obtain a text error correction model.
The loss function constructed by the above steps is as the following calculation formula (7).
Figure BDA0002978373720000113
Wherein loss represents a loss value; pt(i ') characterizing an output probability of an i' th character in a training text included in the training sample; t characterizes the total number of words of the training text comprised by the training sample.
In summary, the scheme provided by the embodiment of the invention integrates the domain knowledge into the encoding process, and improves the error detection and correction capability of the text error correction model. In addition, the pinyin characteristics of the characters to be corrected are added in the coding process; meanwhile, during decoding, the constraint of a confusion set is introduced into the generation range of the text error correction model, so that the search space is reduced, and the accuracy and the calculation efficiency of prediction are improved.
As shown in fig. 7, an embodiment of the present invention provides an apparatus 700 for text error correction, where the apparatus 700 for text error correction may include: a text processing module 701, a domain matching module 702, and a text correction module 703, wherein,
the text processing module 701 is used for acquiring a text to be corrected and generating character pinyin for the text to be corrected;
the domain matching module 702 is configured to divide the text to be corrected into a plurality of character segments with preset lengths, and match domain entries for the character segments in a preset domain knowledge base according to the character pinyin corresponding to the character segments;
the text error correction module 703 is configured to input the text to be error-corrected, the character pinyin and the field entry into the text error correction model, correct the text to be error-corrected by using the text error correction model, and output a correct text after error correction; the text error correction model is obtained by training a training sample, the text error correction model is obtained by training the training sample, the training sample comprises an error text and a correct text corresponding to the error text, and training information input for the text error correction model comprises: the wrong text, the character pinyin for the wrong text, and the domain entries for the wrong text.
In the embodiment of the present invention, the domain matching module 702 is configured to divide a text to be corrected into a plurality of character segments with preset lengths; and retrieving the domain entries for the character segments in a preset domain knowledge base according to the character pinyin corresponding to the character segments.
In the embodiment of the present invention, the text error correction module 703 is configured to convert the text to be error corrected, the character pinyin, and the field entry into corresponding vector representations, respectively; inputting the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry into a text correction model; the text error correction model calculates the output probability distribution of the characters based on the vector representation of the text to be error corrected, the vector representation of the character pinyin and the vector of the field entry; and determining the characters included in the correct text according to the output probability distribution of the characters.
In the embodiment of the present invention, the text error correction module 703 is further configured to encode the vector representation of the text to be error corrected, the vector representation of the character pinyin, and the vector representation of the field entry by using an encoder, respectively; inputting the result of the encoding into a decoder comprised by the text error correction model; the decoder calculates the output probability of the character according to the result of the encoding.
In the embodiment of the present invention, the text error correction module 703 is configured to blend the vector representation of the domain entry into the vector representation of the text to be error corrected and the vector representation of the character pinyin; and encoding the merged result.
In this embodiment of the present invention, the text error correction module 703 is configured to determine an confusion set of each character in a text to be error corrected, where the confusion set includes a plurality of approximate characters; the step of calculating an output probability distribution of the character is performed by the decoder based on the confusion set.
In the embodiment of the present invention, the text error correction module 703 is configured to calculate, based on the text to be error corrected and the field entries retrieved for the text to be error corrected, a copy probability of each character included in the text to be error corrected; calculating the generation probability of each character in the word list based on the confusion set corresponding to each character; and calculating the output probability distribution of each character according to the generation probability of each character in the word list and the copying probability of each character in the text to be corrected.
In this embodiment of the present invention, the text error correction module 703 is further configured to construct a confusion set indication matrix for the text to be error corrected according to the confusion set corresponding to each character; the output range in the generating mode can be limited in the confusion set through the confusion set indication matrix, and the step of calculating the generating probability of each character in the word list is executed.
In this embodiment of the present invention, the text error correction module 703 is further configured to construct a loss function by using the output probability of each training sample; the model parameters are trained by minimizing the value of the loss function to obtain a text error correction model.
The text error correction device can be installed on the client side in a plug-in mode, and can also be installed on a server side communicating with the client side.
Fig. 8 shows an exemplary system architecture 800 of a text error correction method or apparatus to which embodiments of the invention may be applied.
As shown in fig. 8, the system architecture 800 may include terminal devices 801, 802, 803, a network 804, and a server 805. The network 804 serves to provide a medium for communication links between the terminal devices 801, 802, 803 and the server 805. Network 804 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 801, 802, 803 to interact with a server 805 over a network 804 to receive or send messages or the like. The terminal devices 801, 802, 803 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 801, 802, 803 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 805 may be a server providing various services, for example, the server 805 encapsulates the trained text error correction model into a text error correction device or a plug-in, and may publish the text error correction device or the plug-in to various communication client applications installed in the terminal devices 801, 802, 803 via the network 804. The server 805 may further encapsulate the trained text correction model into a text correction device and operate the text correction device.
For the case that the server 805 encapsulates the trained text error correction model into a text error correction device or into a plug-in, and can issue the text error correction device or plug-in to various communication client applications installed in the terminal devices 801, 802, and 803 through the network 804, when the communication client applications on the terminal devices 801, 802, and 803 receive externally input text, the text is taken as a text to be error corrected, and the text to be error corrected is subjected to error correction processing through the text error correction device or plug-in.
Aiming at the situation that the server 805 encapsulates the trained text error correction model into a text error correction device and operates the text error correction device, a text input by a user through a communication client application on the terminal equipment 801, 802 and 803 is obtained, the text is used as a text to be error corrected, the text to be error corrected is subjected to error correction processing through a text error correction device or a plug-in, and the correct text after error correction processing is output to the communication client application on the terminal equipment 801, 802 and 803.
It should be noted that the method for text error correction provided by the embodiment of the present invention may be executed by the terminal devices 801, 802, and 803 or the server 805, and accordingly, the apparatus for text error correction may be disposed in the terminal devices 801, 802, and 803 or the server 805.
It should be understood that the number of terminal devices, networks, and servers in fig. 8 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 9, shown is a block diagram of a computer system 900 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 9, the computer system 900 includes a Central Processing Unit (CPU)901 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the system 900 are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The above-described functions defined in the system of the present invention are executed when the computer program is executed by a Central Processing Unit (CPU) 901.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a text processing module, a domain matching module, and a text correction module. The names of the modules do not form a limitation on the modules themselves under certain conditions, for example, the text processing module may also be described as a "module for acquiring a text to be corrected and generating a character spelling for the text to be corrected".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: acquiring a text to be corrected, and generating character pinyin for the text to be corrected; retrieving a domain entry for the text to be corrected in a preset domain knowledge base; inputting a text to be corrected, character pinyin and field entries into a text correction model, wherein the text correction model is obtained by training a training sample, the training sample comprises an error text and a correct text corresponding to the error text, and training information input for the text correction model comprises: the method comprises the following steps of (1) obtaining a wrong text, a character pinyin of the wrong text and a field entry of the wrong text; and correcting the text to be corrected by using the text correction model, and outputting the corrected correct text.
According to the technical scheme provided by the embodiment of the invention, the character pinyin is generated for the text to be corrected, the text to be corrected is divided into a plurality of character segments with preset lengths, and the field entries are matched for the character segments in the preset field knowledge base according to the character pinyin corresponding to the character segments, namely the character pinyin and the field entries are introduced into the text to be corrected.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (12)

1. A method of text correction, comprising:
acquiring a text to be corrected, and generating character pinyin for the text to be corrected;
retrieving a domain entry for the text to be corrected in a preset domain knowledge base;
inputting the text to be corrected, the character pinyin and the field entries into a text correction model, wherein the text correction model is obtained by training a training sample, the training sample comprises an error text and a correct text corresponding to the error text, and the training information input for the text correction model comprises: the error text, the character pinyin of the error text and the domain entries of the error text;
and correcting the text to be corrected by using the text correction model, and outputting the corrected correct text.
2. The method of claim 1, wherein matching a domain entry for the text to be corrected comprises:
dividing the text to be corrected into a plurality of character segments with preset lengths;
and retrieving a domain entry for the character segment in a preset domain knowledge base according to the character pinyin corresponding to the character segment.
3. The method of claim 1, further comprising:
respectively converting the text to be corrected, the character pinyin and the field entries into corresponding vector representations;
correcting the text to be corrected, including:
inputting the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry into the text correction model;
the text error correction model calculates the output probability distribution of characters based on the vector representation of the text to be error corrected, the vector representation of the character pinyin and the vector of the field entry;
and determining the characters included in the correct text according to the output probability distribution of the characters.
4. The method of claim 3, wherein computing the output probability distribution for the character comprises:
respectively encoding the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry by using the encoder;
inputting the result of the encoding into a decoder comprised by the text correction model;
and the decoder calculates the output probability distribution of the characters according to the encoding result.
5. The method of claim 4, wherein encoding the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the domain entry by the encoder respectively comprises:
the vector representation of the field entry is blended into the vector representation of the text to be corrected and the vector representation of the character pinyin;
and encoding the merged result.
6. The method of claim 4, further comprising:
determining a confusion set of each character in the text to be corrected, wherein the confusion set comprises a plurality of approximate characters;
based on the set of aliases, the decoder performs the step of calculating an output probability distribution for the character.
7. The method of claim 6, wherein computing the output probability distribution for the character comprises:
calculating the copy probability of each character included in the text to be corrected based on the text to be corrected and the field entry retrieved for the text to be corrected;
calculating the generation probability of the characters included in the word list based on the confusion set corresponding to each character;
and calculating the output probability of each character according to the generation probability of the character included in the word list and the copying probability of each character included in the text to be corrected.
8. The method of claim 7, further comprising:
constructing a confusion set indication matrix for the text to be corrected according to the confusion set corresponding to each character;
and limiting the output range in the generation mode to the confusion set through the confusion set indication matrix, and executing the step of calculating the generation probability of each character included in the word list.
9. The method of claim 6, further comprising:
constructing a loss function by using the output probability of each training sample;
training model parameters by minimizing the value of the loss function to obtain the text error correction model.
10. An apparatus for correcting text, comprising: a text processing module, a domain matching module, and a text correction module, wherein,
the text processing module is used for acquiring a text to be corrected and generating character pinyin for the text to be corrected;
the domain matching module is used for dividing the text to be corrected into a plurality of character segments with preset lengths and matching domain entries for the character segments in a preset domain knowledge base according to the character pinyin corresponding to the character segments;
the text error correction module is used for inputting the text to be corrected, the character pinyin and the field entries into a text error correction model, correcting the text to be corrected by using the text error correction model and outputting a corrected text; the text error correction model is obtained by training a training sample, the text error correction model is obtained by training the training sample, the training sample comprises an error text and a correct text corresponding to the error text, and training information input for the text error correction model comprises: the wrong text, the character pinyin of the wrong text and the domain entries of the wrong text.
11. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.
12. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-9.
CN202110279919.9A 2021-03-16 2021-03-16 Text error correction method and device Pending CN113051894A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110279919.9A CN113051894A (en) 2021-03-16 2021-03-16 Text error correction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110279919.9A CN113051894A (en) 2021-03-16 2021-03-16 Text error correction method and device

Publications (1)

Publication Number Publication Date
CN113051894A true CN113051894A (en) 2021-06-29

Family

ID=76512806

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110279919.9A Pending CN113051894A (en) 2021-03-16 2021-03-16 Text error correction method and device

Country Status (1)

Country Link
CN (1) CN113051894A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114239559A (en) * 2021-11-15 2022-03-25 北京百度网讯科技有限公司 Method, apparatus, device and medium for generating text error correction and text error correction model
CN116757184A (en) * 2023-08-18 2023-09-15 昆明理工大学 Vietnam voice recognition text error correction method and system integrating pronunciation characteristics

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014036827A1 (en) * 2012-09-10 2014-03-13 华为技术有限公司 Text correcting method and user equipment
US20180060302A1 (en) * 2016-08-24 2018-03-01 Microsoft Technology Licensing, Llc Characteristic-pattern analysis of text
CN111428494A (en) * 2020-03-11 2020-07-17 中国平安人寿保险股份有限公司 Intelligent error correction method, device and equipment for proper nouns and storage medium
CN111523306A (en) * 2019-01-17 2020-08-11 阿里巴巴集团控股有限公司 Text error correction method, device and system
CN111626048A (en) * 2020-05-22 2020-09-04 腾讯科技(深圳)有限公司 Text error correction method, device, equipment and storage medium
CN112287670A (en) * 2020-11-18 2021-01-29 北京明略软件系统有限公司 Text error correction method, system, computer device and readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014036827A1 (en) * 2012-09-10 2014-03-13 华为技术有限公司 Text correcting method and user equipment
US20180060302A1 (en) * 2016-08-24 2018-03-01 Microsoft Technology Licensing, Llc Characteristic-pattern analysis of text
CN111523306A (en) * 2019-01-17 2020-08-11 阿里巴巴集团控股有限公司 Text error correction method, device and system
CN111428494A (en) * 2020-03-11 2020-07-17 中国平安人寿保险股份有限公司 Intelligent error correction method, device and equipment for proper nouns and storage medium
CN111626048A (en) * 2020-05-22 2020-09-04 腾讯科技(深圳)有限公司 Text error correction method, device, equipment and storage medium
CN112287670A (en) * 2020-11-18 2021-01-29 北京明略软件系统有限公司 Text error correction method, system, computer device and readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
汪权彬, 谭营: "基于数据增广和复制的中文语法错误纠正方法", 智能系统学报, vol. 15, no. 1, pages 99 - 105 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114239559A (en) * 2021-11-15 2022-03-25 北京百度网讯科技有限公司 Method, apparatus, device and medium for generating text error correction and text error correction model
CN114239559B (en) * 2021-11-15 2023-07-11 北京百度网讯科技有限公司 Text error correction and text error correction model generation method, device, equipment and medium
CN116757184A (en) * 2023-08-18 2023-09-15 昆明理工大学 Vietnam voice recognition text error correction method and system integrating pronunciation characteristics
CN116757184B (en) * 2023-08-18 2023-10-20 昆明理工大学 Vietnam voice recognition text error correction method and system integrating pronunciation characteristics

Similar Documents

Publication Publication Date Title
US8626486B2 (en) Automatic spelling correction for machine translation
US20210312139A1 (en) Method and apparatus of generating semantic feature, method and apparatus of training model, electronic device, and storage medium
CN109376234B (en) Method and device for training abstract generation model
JP7335300B2 (en) Knowledge pre-trained model training method, apparatus and electronic equipment
JP7413630B2 (en) Summary generation model training method, apparatus, device and storage medium
WO2023201975A1 (en) Difference description sentence generation method and apparatus, and device and medium
CN114861889B (en) Deep learning model training method, target object detection method and device
CN111488742B (en) Method and device for translation
CN113051894A (en) Text error correction method and device
CN111813923A (en) Text summarization method, electronic device and storage medium
KR20210125449A (en) Method for industry text increment, apparatus thereof, and computer program stored in medium
CN113743101A (en) Text error correction method and device, electronic equipment and computer storage medium
CN110852057A (en) Method and device for calculating text similarity
CN110472241B (en) Method for generating redundancy-removed information sentence vector and related equipment
US20230153550A1 (en) Machine Translation Method and Apparatus, Device and Storage Medium
CN111814493A (en) Machine translation method, device, electronic equipment and storage medium
CN112818091A (en) Object query method, device, medium and equipment based on keyword extraction
CN116108181A (en) Client information processing method and device and electronic equipment
CN115879480A (en) Semantic constraint machine translation method and device, electronic equipment and storage medium
JP2023002730A (en) Text error correction and text error correction model generating method, device, equipment, and medium
CN115357710A (en) Training method and device for table description text generation model and electronic equipment
CN112560466A (en) Link entity association method and device, electronic equipment and storage medium
WO2022141855A1 (en) Text regularization method and apparatus, and electronic device and storage medium
CN113051896A (en) Method and device for correcting text, electronic equipment and storage medium
CN112364657A (en) Method, device, equipment and computer readable medium for generating text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 221, 2 / F, block C, 18 Kechuang 11th Street, Beijing Economic and Technological Development Zone, 100176

Applicant after: Jingdong Technology Holding Co.,Ltd.

Address before: Room 221, 2 / F, block C, 18 Kechuang 11th Street, Beijing Economic and Technological Development Zone, 100176

Applicant before: Jingdong Digital Technology Holding Co., Ltd

CB02 Change of applicant information