CN113051894B - Text error correction method and device - Google Patents
Text error correction method and device Download PDFInfo
- Publication number
- CN113051894B CN113051894B CN202110279919.9A CN202110279919A CN113051894B CN 113051894 B CN113051894 B CN 113051894B CN 202110279919 A CN202110279919 A CN 202110279919A CN 113051894 B CN113051894 B CN 113051894B
- Authority
- CN
- China
- Prior art keywords
- text
- corrected
- character
- pinyin
- domain
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012937 correction Methods 0.000 title claims abstract description 139
- 238000000034 method Methods 0.000 title claims abstract description 51
- 238000012549 training Methods 0.000 claims abstract description 37
- 239000013598 vector Substances 0.000 claims description 72
- 239000011159 matrix material Substances 0.000 claims description 14
- 238000012545 processing Methods 0.000 claims description 13
- 230000010076 replication Effects 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 8
- 239000012634 fragment Substances 0.000 claims description 7
- 230000008569 process Effects 0.000 description 18
- 238000010586 diagram Methods 0.000 description 14
- 230000006870 function Effects 0.000 description 14
- 238000004891 communication Methods 0.000 description 12
- 238000004364 calculation method Methods 0.000 description 11
- 238000005406 washing Methods 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000004927 fusion Effects 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 241000255925 Diptera Species 0.000 description 1
- HCHKCACWOHOZIP-UHFFFAOYSA-N Zinc Chemical group [Zn] HCHKCACWOHOZIP-UHFFFAOYSA-N 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011143 downstream manufacturing Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Document Processing Apparatus (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a text error correction method and device, and relates to the technical field of computers. The specific implementation mode of the method comprises the following steps: acquiring a text to be corrected, and generating character pinyin for the text to be corrected; searching a domain entry for a text to be corrected in a preset domain knowledge base; inputting a text to be corrected, a character pinyin and an domain entry into a text correction model, wherein the text correction model is trained by training samples, the training samples comprise error texts and corresponding correct texts, and training information input for the text correction model comprises: error text, pinyin for characters of the error text, and domain entries for the error text; and correcting the text to be corrected by using the text correction model. The embodiment can improve the accuracy and efficiency of text error correction.
Description
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for text error correction.
Background
In many application scenarios, such as searching, text conversion, intention recognition, intelligent customer service, etc., text error correction (i.e. a process of correcting errors in text) is involved, and the downstream processing process can accurately perform lexical analysis, intention recognition, etc. on the text, so that the text error correction plays a role in protecting navigation from the perspective of natural language processing overall technology.
Text correction currently generally relies on manually constructed wrongly written word dictionaries for error matching and correction.
In the process of implementing the present invention, the inventor finds that at least the following problems exist in the prior art:
due to the limitations of the misclassification dictionary, some relatively rare proper nouns and the like may not be included in the misclassification dictionary, resulting in lower accuracy and efficiency of text correction.
Disclosure of Invention
In view of this, the embodiments of the present invention provide a method and apparatus for text error correction, which can effectively improve the accuracy and efficiency of text error correction.
To achieve the above object, according to one aspect of the embodiments of the present invention, there is provided a text error correction method, including:
Acquiring a text to be corrected, and generating character pinyin for the text to be corrected;
searching a domain entry for the text to be corrected in a preset domain knowledge base;
Inputting the text to be corrected, the character pinyin and the field entry into a text correction model, wherein the text correction model is trained by a training sample, the training sample comprises an error text and a correct text corresponding to the error text, and training information input for the text correction model comprises: the error text, the pinyin of the characters of the error text and the domain entry of the error text;
and correcting the text to be corrected by using the text correction model, and outputting corrected correct text.
Preferably, the matching field entry for the text to be corrected includes:
dividing the text to be corrected into a plurality of character fragments with preset lengths;
And searching the domain entry for the character segment in a preset domain knowledge base according to the character pinyin corresponding to the character segment.
Preferably, the text error correction method further includes:
respectively converting the text to be corrected, the character pinyin and the field entry into corresponding vector representations;
correcting the text to be corrected, including:
Inputting the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry into the text correction model;
the text error correction model calculates the output probability distribution of characters based on the vector representation of the text to be corrected, the vector representation of the pinyin of the characters and the vector of the domain entry;
and determining the characters included in the correct text according to the output probability distribution of the characters.
Preferably, calculating the output probability distribution of the character includes:
Encoding the vector representation of the text to be corrected, the vector representation of the pinyin of the character and the vector representation of the domain entry by using the encoder respectively;
inputting the encoded result into a decoder included in the text error correction model;
and the decoder calculates the output probability distribution of the character according to the coding result.
Preferably, the encoding means is configured to encode the vector representation of the text to be corrected, the vector representation of the pinyin for the character, and the vector representation of the term of art, respectively, including:
merging the vector representation of the domain entry into the vector representation of the text to be corrected and the vector representation of the character pinyin;
And encoding the merged result.
Preferably, the text error correction method further includes:
Determining an confusion set of each character in the text to be corrected, wherein the confusion set comprises a plurality of approximate characters;
based on the confusion set, the decoder performs the step of calculating an output probability distribution for the character.
Preferably, calculating the output probability distribution of the character includes:
calculating the copy probability of each character included in the text to be corrected based on the text to be corrected and the field entry;
calculating the generation probability of the characters included in the word list based on the confusion set corresponding to each character;
And calculating the output probability distribution of each character according to the generation probability of the character included in the vocabulary and the replication probability of each character included in the text to be corrected.
Preferably, the text error correction method further includes:
Constructing an confusion set indication matrix for the text to be corrected according to the confusion set corresponding to each character;
The step of calculating the generation probability of each of the characters included in the vocabulary may be performed by limiting the output range in the generation mode to the confusion set by the confusion set indication matrix.
Preferably, the text error correction method further includes:
constructing a loss function by using the output probability of each training sample;
model parameters are trained by minimizing the value of the loss function to obtain the text error correction model.
In a second aspect, an embodiment of the present invention provides a text error correction apparatus, including: a text processing module, a field matching module and a text error correction module, wherein,
The text processing module is used for acquiring the text to be corrected and generating character pinyin for the text to be corrected;
The domain matching module is used for dividing the text to be corrected into a plurality of character segments with preset lengths, and searching domain entries for the character segments in a preset domain knowledge base according to the pinyin of the characters corresponding to the character segments;
The text correction module is used for inputting the text to be corrected, the character pinyin and the field entry into a text correction model, correcting the text to be corrected by using the text correction model, and outputting corrected correct text; the text error correction model is obtained by training a training sample, the training sample comprises an error text and a correct text corresponding to the error text, and training information input for the text error correction model comprises: the error text, the pinyin of the characters of the error text, and the domain term of the error text.
One embodiment of the above application has the following advantages or benefits: according to the scheme provided by the application, the character pinyin is generated for the text to be corrected, the text to be corrected is divided into a plurality of character fragments with preset lengths, and the domain vocabulary entry is searched for the character fragments in the preset domain knowledge base according to the character pinyin corresponding to the character fragments, namely, the character pinyin and the domain vocabulary entry are introduced into the text to be corrected, so that the character pinyin and the domain vocabulary entry can be introduced, on one hand, the characteristics of the text to be corrected can be added, and on the other hand, the range of copying or generating correct text of a text correction model can be reduced, and therefore, the accuracy of text correction and the text correction efficiency can be effectively improved.
Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of the main flow of a method of text correction according to an embodiment of the invention;
FIG. 2 is a schematic diagram of the main structure of text error correction according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of the main flow of error correction for text to be error corrected according to an embodiment of the invention;
FIG. 4 is a schematic diagram of the main flow of calculating the output probabilities of characters of lemmas in the fusion field according to embodiments of the present invention;
FIG. 5 is a schematic diagram of the main flow of computing the output probabilities of characters of the lemma vocabulary entries based on confusion sets, according to an embodiment of the invention;
FIG. 6 is a schematic diagram of the main flow of computing the output probabilities of characters of the lemma vocabulary entries based on confusion sets, according to another embodiment of the invention;
FIG. 7 is a schematic diagram of the main modules of an apparatus for text error correction according to an embodiment of the present invention;
FIG. 8 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;
fig. 9 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a method of text correction according to an embodiment of the present invention, as shown in fig. 1, the method of text correction may include the steps of:
Step S101: acquiring a text to be corrected, and generating character pinyin for the text to be corrected;
Since characters in a text generally have a variety of homophones, the correction direction of the text to be corrected can be better expanded through the pinyin of the characters. For example, "zhe kuan xi yi ji shi xi hong yi ti de ma" may be generated by this step with the text to be corrected being "the washing machine is a western rainbow. Since pinyin can correspond to a variety of homophones, introducing character pinyin through this step can provide more error correction features later.
Step S102: searching a domain entry for a text to be corrected in a preset domain knowledge base;
Step S103: inputting a text to be corrected, a character pinyin and an domain entry into a text correction model, wherein the text correction model is trained by training samples, the training samples comprise error texts and correct texts corresponding to the error texts, and training information input into the text correction model comprises: error text, pinyin for characters of the error text, and domain entries for the error text;
Step S104: and correcting the text to be corrected by using the text correction model, and outputting corrected correct text.
The text to be corrected can be derived from information of a search box input by a user, information input by the user on an intelligent question-answering page, primary text converted by voice and the like. The primary text converted by the voice is directly converted by the existing voice text conversion technology, and the text error condition of the converted primary text can be caused due to the limitation of the voice conversion technology.
The domain knowledge base refers to a collection of professional vocabularies in the domain, for example, an attribute knowledge base of a commodity (in which information such as a model series, attribute nouns, and the like of the commodity are listed). The domain knowledge base may include various general domain knowledge and data of the masses in a long-tail distribution (the data in the long-tail distribution refers to a unique, very useful, rare vocabulary in a certain domain), and the like.
The text correction model combines, among other things, a model framework in Network Machine Translation (NMT) tasks and a transducer framework for self-attention mechanisms. The model framework adopts an end-to-end text error correction model that incorporates domain knowledge. The model framework in the NMT task mainly adopts a sequence model of an encoder and a decoder (the encoder is mainly responsible for encoding a source language to obtain a final feature vector, and the decoder generates a target language sequence, namely an error correction text, according to the feature vector information. The application introduces a self-attention mechanism transducer framework in a sequence model of an encoder & decoder (encoder-decoder) to determine whether characters in corrected correct texts originate from copying or generating, thereby improving the text error correction efficiency.
In the embodiment shown in fig. 1, according to the scheme provided by the application, by generating the character pinyin for the text to be corrected and dividing the text to be corrected into a plurality of character segments with preset lengths, and according to the character pinyin corresponding to the character segments, matching the field vocabulary entries for the character segments in a preset field knowledge base, namely, introducing the character pinyin and the field vocabulary entries for the text to be corrected, wherein the character pinyin and the field vocabulary entries are introduced, so that on one hand, features can be added for the text to be corrected, and on the other hand, the range of copying or generating correct text by a text correction model can be reduced, thereby effectively improving the accuracy of text correction and the efficiency of text correction.
The specific embodiment of step S102 may include: dividing a text to be corrected into a plurality of character fragments with preset lengths; and searching the domain entry for the character segment in a preset domain knowledge base according to the pinyin of the character corresponding to the character segment. The retrieved domain term is a domain term associated with the character segment. The character segments with the preset length can be character segments with the length of 2 characters or character segments with the length of 3 characters, and the like, and the preset length can be correspondingly set according to the requirements of users. For example, for the text to be corrected, "the washing machine is a western rainbow integrated machine", the text is divided into character segments with the length of 3 characters: "washing machine", "Sihon one", in which conventional nonsensical words, such as "yes", etc., are filtered out. The spelling "xi hong yi" of the character segment "Xihong one" is matched with the field vocabulary entry "washing and baking integrated" in the field knowledge base. Namely, the text to be corrected is divided into l character segments through the process, and accordingly, a field entry set can be obtained:
K={k1,k2,…,kl}。
The main architecture of text correction by the text correction model is shown in fig. 2. As can be seen from fig. 2, according to the scheme provided by the embodiment of the present invention, a text to be corrected (for example, "the washing machine is a west-siphon integrated machine"), a pinyin (for example, "zhe kuan xi yi ji shi xi hong yi ti de ma") of the text to be corrected, and related domain terms (for example, "wash-bake integrated machine") are encoded by an encoder, domain knowledge is fused into the encoding of an original text by means of a cross-attention mechanism, the replication probability of each character in the text to be corrected is calculated (wherein cross attention refers to cross-attention, i.e., the domain knowledge is fused into the encoding of the original text), the generation probability of each character in the text to be corrected is calculated by a decoder, the replication probability and the generation probability of each character are utilized to obtain the output probability of each character, and the corrected correct text is output according to the output probability.
For the main process of the scheme provided by the embodiment of the invention shown in fig. 2, the various embodiments of the invention respectively improve the processes of encoding process, decoding process, calculating the duplication probability, calculating the generation probability, calculating the output probability distribution and the like.
In an embodiment of the present invention, the text error correction method may further include: respectively converting the text to be corrected, the character pinyin and the field entry into corresponding vector representations; accordingly, as shown in fig. 3, the specific embodiment for correcting the text to be corrected may include the following steps:
Step S301: inputting the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry into a text correction model;
the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry can be realized by means of existing text conversion vectors, such as directly converting the text to be corrected and the character pinyin into a set of vector representations by an encoder And converting each domain term of the domain terms K= { K 1,k2,…,kl } into a vector representation, and obtaining a vector representation set of the domain terms
Step S302: the text error correction model calculates the output probability distribution of characters based on the vector representation of the text to be corrected, the vector representation of the pinyin of the characters and the vector of the domain entry;
the output probability distribution process of the calculated characters introduces vector representation of the pinyin of the characters and vectors of terms in the field, namely, the characteristics of the text to be corrected are added, so that the accuracy of calculating the output probability can be effectively improved.
Step S303: and determining the characters included in the correct text according to the output probability distribution of the characters.
In an embodiment of the present invention, as shown in fig. 4, the method for text error correction may further include the following steps:
step S401: encoding the vector representation of the text to be corrected, the vector representation of the pinyin of the character and the vector representation of the term of the field by using an encoder respectively;
The specific embodiment of step S401 may include: merging the vector representation of the domain entry into the vector representation of the text to be corrected and the vector representation of the character pinyin; and encoding the merged result.
The encoding process can adopt the following calculation formula (1) and the following calculation formula (2) to realize the character vector representation of the fusion field entry.
Wherein alpha ij represents the attention weight of the text to be corrected after the i character and the j field entry are fused; softmax () characterizes the softmax function; w q、Wk and W v represent parameter matrices trained by the encoder, and d represents the parameter matrices trained by the encoderIs a dimension of (2); Representing a vector representation of an ith character in the text to be corrected; Representing a vector representation of a j-th character in the field entry retrieved for the i-th character; h i represents a character representation of the fused knowledge.
That is, the result of the above calculation formula (2) is the output result of the encoder (the character of the fusion knowledge represents the corresponding code).
Step S402: inputting the encoded result into a decoder included in the text error correction model;
step S403: and a step in which the decoder calculates the output probability distribution of the character based on the result of the encoding.
Because the field entry is introduced in the coding, the decoder can calculate the output probability distribution of the characters more accurately in the decoding process.
In an embodiment of the present invention, as shown in fig. 5, the method for text error correction may further include the following steps:
Step S501: determining an confusion set of each character in the text to be corrected, wherein the confusion set comprises a plurality of approximate characters;
The specific implementation manner of this step S501 may be: the confusion sets for each character are stored in advance in the database as shown in table 1 below. And searching the confusion set of each character in the text to be corrected by a searching mode.
TABLE 1
Character(s) | Confusion set |
Lifting device | Shengsheng nephew sound …, shengsheng province |
Text (A) | Weatherproof and turbulent mosquito smelling line … |
… | … |
Heart shape | Xinxinxinxin zinc core … dispute Xinzhen |
The specific implementation manner of this step S501 may also be: searching approximate characters, near-voice characters and near-shape characters for each character in the text to be corrected, and combining the searched approximate characters, near-voice characters and near-shape characters into a confusion set of the corresponding characters.
Step S502: based on the confusion set, a step of calculating an output probability of a character in the text to be corrected is performed.
As shown in fig. 6, the process of calculating the output probability of the character in the text to be corrected in the step S502 may include the following steps:
Step S601: calculating the copy probability of each character included in the text to be corrected based on the text to be corrected and the field entry;
the calculation process of this step is calculated by the following calculation formula (3).
Wherein,Characterizing the replication probability (the replication probability refers to the probability of replication from a source (text to be corrected, field entry) for the ith character in the text to be corrected) at the nth moment of decoding; S t represents the hidden state of decoding at the moment t; w q and W k represent parameter matrices obtained by training; The original input text (text to be error corrected) representation is characterized, and the coded representation of domain knowledge.
Step S602: calculating the generation probability of the characters included in the vocabulary based on the confusion set corresponding to each character;
the vocabulary refers to a vocabulary formed by characters in the confusion set corresponding to each character.
The specific implementation manner of the step S602 may include: constructing an confusion set indication matrix for the text to be corrected according to the confusion set corresponding to each character; by the confusion set indication matrix, the output range in the generation mode can be limited to the confusion set, and the generation probability of each character in the word list is calculated.
The confusion set indication matrix M epsilon R g*|V|, wherein |V| represents the number of characters included in all confusion sets corresponding to the text to be corrected, and g represents the length of the text to be corrected; element M if in M takes on a value of 0 or 1. The value of the element M if in M can be calculated by the following calculation formula (4).
By calculating the generation probability of each character by the calculation formula (5) using the confusion set indication matrix M, the output range in the generation mode can be limited to the confusion set:
characterizing the generation probability of a t-time confusion set for carrying out error correction decoding on a text to be corrected; Representing a reference generation probability of t moment of error correction of a text to be corrected preset in a decoder; m i characterizes a new confusion-indicating matrix (i.e., a new confusion-indicating matrix formed by the ith row in the confusion-indicating matrix) made up of elements associated with character i obtained from the confusion-indicating matrix.
Step S603: and calculating the output probability distribution of each character according to the generation probability of each character included in the vocabulary and the replication probability of each character included in the text to be corrected.
Each character is a character included in a vocabulary and a character included in a text to be corrected.
In this step, the calculated output probability can be calculated by the following calculation formula (6).
Wherein, P t (i) represents the output probability of the ith character in the text to be corrected at the time t of the error correction decoding process of the text to be corrected; beta represents the weight of the copy mode obtained by training, and is used as a balance factor generated from the confusion word list and copied from the text to be corrected during decoding; Characterizing the result calculated by the above calculation formula (5); the method comprises And (3) representing the copy probability of the ith character in the text to be corrected at the time t of performing error correction decoding on the text to be corrected. Calculated by the calculation formula (3). The duplication probability is the probability of duplication from the source (text with error correction, domain entry).
In an embodiment of the present invention, the text error correction method may further include: constructing a loss function by using the output probability of each training sample; model parameters are trained by minimizing the value of the loss function to obtain a text error correction model.
The loss function constructed in the above step is calculated as in the following formula (7).
Wherein loss characterizes the loss value; p t (i ') represents the output probability of the i' th character in the training text included in the training sample; t represents the total number of words of the training text comprised by the training sample.
In summary, the scheme provided by the embodiment of the invention integrates the domain knowledge into the coding process, and improves the error detection and correction capability of the text error correction model. In addition, the spelling characteristic of the character to be corrected is added in the encoding process; meanwhile, during decoding, constraint of a confusion set is introduced into the generation range of the text error correction model, so that the search space is reduced, and the prediction accuracy and the calculation efficiency are improved.
As shown in fig. 7, an embodiment of the present invention provides a text error correction apparatus 700, where the text error correction apparatus 700 may include: a text processing module 701, a domain matching module 702, and a text error correction module 703, wherein,
The text processing module 701 is configured to obtain a text to be corrected, and generate a character pinyin for the text to be corrected;
The domain matching module 702 is configured to divide the text to be corrected into a plurality of character segments with preset lengths, and match domain entries for the character segments in a preset domain knowledge base according to the pinyin of the characters corresponding to the character segments;
The text correction module 703 is configured to input the text to be corrected, the character pinyin and the domain entry into a text correction model, correct the text to be corrected by using the text correction model, and output a corrected correct text; the text error correction model is obtained by training a training sample, the training sample comprises an error text and a correct text corresponding to the error text, and training information input for the text error correction model comprises: error text, pinyin for characters of the error text, and domain entries for the error text.
In the embodiment of the present invention, the field matching module 702 is configured to divide a text to be corrected into a plurality of character segments with preset lengths; and searching the domain entry for the character segment in a preset domain knowledge base according to the pinyin of the character corresponding to the character segment.
In the embodiment of the present invention, the text error correction module 703 is configured to convert the text to be corrected, the pinyin of the character, and the term of the field into corresponding vector representations respectively; inputting the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry into a text correction model; the text error correction model calculates the output probability distribution of characters based on the vector representation of the text to be corrected, the vector representation of the pinyin of the characters and the vector of the domain entry; and determining the characters included in the correct text according to the output probability distribution of the characters.
In the embodiment of the present invention, the text error correction module 703 is further configured to encode, with an encoder, a vector representation of a text to be error corrected, a vector representation of a pinyin of a character, and a vector representation of an entry of a domain, respectively; inputting the encoded result into a decoder included in the text error correction model; the decoder calculates the output probability of the character based on the result of the encoding.
In the embodiment of the present invention, the text correction module 703 is configured to integrate the vector representation of the term of the field into the vector representation of the text to be corrected and the vector representation of the pinyin of the character; and encoding the merged result.
In an embodiment of the present invention, the text error correction module 703 is configured to determine an confusion set for each character in the text to be error corrected, where the confusion set includes a plurality of approximate characters; the step of calculating an output probability distribution of the character is performed by the decoder based on the confusion set.
In the embodiment of the present invention, the text correction module 703 is configured to calculate a duplication probability of each character included in the text to be corrected based on the text to be corrected and the domain entry retrieved for the text to be corrected; calculating the generation probability of each character in the vocabulary based on the confusion set corresponding to each character; and calculating the output probability distribution of each character according to the generation probability of each character in the word list and the replication probability of each character in the text to be corrected.
In the embodiment of the present invention, the text error correction module 703 is further configured to construct an confusion set indication matrix for the text to be corrected according to the confusion set corresponding to each character; the step of calculating the generation probability of each character in the vocabulary may be performed by limiting the output range in the generation mode to the confusion set by the confusion set indication matrix.
In the embodiment of the present invention, the text error correction module 703 is further configured to construct a loss function by using the output probability of each training sample; model parameters are trained by minimizing the value of the loss function to obtain a text error correction model.
The text error correction device can be installed on a client in a plug-in mode or can be installed on a service end communicated with the client.
Fig. 8 illustrates an exemplary system architecture 800 of a method of text correction or an apparatus of text correction to which embodiments of the present invention may be applied.
As shown in fig. 8, a system architecture 800 may include terminal devices 801, 802, 803, a network 804, and a server 805. The network 804 serves as a medium for providing communication links between the terminal devices 801, 802, 803 and the server 805. The network 804 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
A user may interact with the server 805 through the network 804 using the terminal devices 801, 802, 803 to receive or send messages or the like. Various communication client applications such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 801, 802, 803.
The terminal devices 801, 802, 803 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 805 may be a server providing various services, for example, the server 805 encapsulates the trained text correction model into text correction devices or packages into plug-ins, and may publish the text correction devices or packages to various communication client applications installed by the terminal devices 801, 802, 803 via the network 804. The server 805 may also encapsulate the trained text correction model into and run the text correction device.
Aiming at the situation that the trained text correction model is packaged into a text correction device or a package in a plug-in unit by a server 805, and the text correction device or the package can be issued to various communication client applications installed by terminal equipment 801, 802 and 803 through a network 804, when the communication client applications on the terminal equipment 801, 802 and 803 receive an externally input text, the text is used as the text to be corrected, and the text to be corrected is subjected to correction processing through the text correction device or the package.
Aiming at the condition that the server 805 encapsulates the trained text correction model into a text correction device and runs the text correction device, acquiring a text input by a user through a communication client application on the terminal equipment 801, 802 and 803, taking the text as a text to be corrected, performing correction processing on the text to be corrected through the text correction device or a plug-in, and outputting the corrected correct text to the communication client application on the terminal equipment 801, 802 and 803.
It should be noted that, the method for text error correction provided in the embodiment of the present invention may be executed by the terminal devices 801, 802, 803 or the server 805, and accordingly, the apparatus for text error correction may be set in the terminal devices 801, 802, 803 or the server 805.
It should be understood that the number of terminal devices, networks and servers in fig. 8 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 9, there is illustrated a schematic diagram of a computer system 900 suitable for use in implementing an embodiment of the present invention. The terminal device shown in fig. 9 is only an example, and should not impose any limitation on the functions and the scope of use of the embodiment of the present invention.
As shown in fig. 9, the computer system 900 includes a Central Processing Unit (CPU) 901, which can execute various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the system 900 are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.
The following components are connected to the I/O interface 905: an input section 906 including a keyboard, a mouse, and the like; an output portion 907 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 908 including a hard disk or the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 910 so that a computer program read out therefrom is installed into the storage section 908 as needed.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909 and/or installed from the removable medium 911. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 901.
The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, for example, as: a processor includes a text processing module, a domain matching module, and a text error correction module. The names of these modules do not in some way limit the module itself, and for example, the text processing module may also be described as "a module that obtains text to be corrected and generates character spellings for the text to be corrected".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include: acquiring a text to be corrected, and generating character pinyin for the text to be corrected; searching a domain entry for a text to be corrected in a preset domain knowledge base; inputting a text to be corrected, a character pinyin and an domain entry into a text correction model, wherein the text correction model is trained by training samples, the training samples comprise error texts and correct texts corresponding to the error texts, and training information input into the text correction model comprises: error text, pinyin for characters of the error text, and domain entries for the error text; and correcting the text to be corrected by using the text correction model, and outputting corrected correct text.
According to the technical scheme provided by the embodiment of the application, the character pinyin is generated for the text to be corrected, the text to be corrected is divided into a plurality of character segments with preset lengths, and the field vocabulary entries are matched for the character segments in the preset field knowledge base according to the character pinyin corresponding to the character segments, namely, the character pinyin and the field vocabulary entries are introduced into the text to be corrected, so that the character pinyin and the field vocabulary entries are introduced, the characteristics of the text to be corrected can be increased, the range of copying or generating correct text of a text correction model can be reduced, and the accuracy of text correction and the text correction efficiency can be effectively improved.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.
Claims (11)
1. A method of text correction, comprising:
Acquiring a text to be corrected, and generating character pinyin for the text to be corrected;
searching a domain entry for the text to be corrected in a preset domain knowledge base;
Inputting the text to be corrected, the character pinyin and the field entry into a text correction model, wherein the text correction model is trained by a training sample, the training sample comprises an error text and a correct text corresponding to the error text, and training information input for the text correction model comprises: the error text, the pinyin of the characters of the error text and the domain entry of the error text;
correcting the text to be corrected by using the text correction model, and outputting corrected correct text;
and matching the field entry for the text to be corrected, including:
dividing the text to be corrected into a plurality of character fragments with preset lengths;
and searching the domain vocabulary entry for the character segment in a preset domain knowledge base according to the character pinyin corresponding to the character segment, wherein the domain knowledge base refers to a set containing a series of specialized vocabularies of the domain.
2. The method as recited in claim 1, further comprising:
respectively converting the text to be corrected, the character pinyin and the field entry into corresponding vector representations;
correcting the text to be corrected, including:
Inputting the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry into the text correction model;
the text error correction model calculates the output probability distribution of characters based on the vector representation of the text to be corrected, the vector representation of the pinyin of the characters and the vector of the domain entry;
and determining the characters included in the correct text according to the output probability distribution of the characters.
3. The method of claim 2, wherein calculating the output probability distribution of the character comprises:
Encoding the vector representation of the text to be corrected, the vector representation of the pinyin of the character and the vector representation of the field entry by using an encoder;
inputting the encoded result into a decoder included in the text error correction model;
the decoder calculates an output probability distribution of the character based on the result of the encoding.
4. A method according to claim 3, wherein encoding the vector representation of the text to be corrected, the vector representation of the pinyin for the character, and the vector representation of the domain entry, respectively, with the encoder comprises:
merging the vector representation of the domain entry into the vector representation of the text to be corrected and the vector representation of the character pinyin;
And encoding the merged result.
5. A method as claimed in claim 3, further comprising:
Determining an confusion set of each character in the text to be corrected, wherein the confusion set comprises a plurality of approximate characters;
based on the confusion set, the decoder performs the step of calculating an output probability distribution for the character.
6. The method of claim 5, wherein calculating the output probability distribution of the character comprises:
calculating the duplication probability of each character included in the text to be corrected based on the text to be corrected and the field entry retrieved for the text to be corrected;
calculating the generation probability of the characters included in the word list based on the confusion set corresponding to each character;
and calculating the output probability of each character according to the generation probability of the character included in the vocabulary and the replication probability of each character included in the text to be corrected.
7. The method as recited in claim 6, further comprising:
Constructing an confusion set indication matrix for the text to be corrected according to the confusion set corresponding to each character;
limiting the output range in the generation mode to be within the confusion set through the confusion set indication matrix, and executing the step of calculating the generation probability of each character included in the word list.
8. The method as recited in claim 5, further comprising:
constructing a loss function by using the output probability of each training sample;
model parameters are trained by minimizing the value of the loss function to obtain the text error correction model.
9. An apparatus for text correction, comprising: a text processing module, a field matching module and a text error correction module, wherein,
The text processing module is used for acquiring the text to be corrected and generating character pinyin for the text to be corrected;
The domain matching module is used for dividing the text to be corrected into a plurality of character segments with preset lengths, and matching domain entries for the character segments in a preset domain knowledge base according to the pinyin of the characters corresponding to the character segments;
The text correction module is used for inputting the text to be corrected, the character pinyin and the field entry into a text correction model, correcting the text to be corrected by using the text correction model, and outputting corrected correct text; the text error correction model is obtained by training a training sample, the training sample comprises an error text and a correct text corresponding to the error text, and training information input for the text error correction model comprises: the error text, the pinyin of the characters of the error text and the domain entry of the error text;
the field matching module is used for dividing the text to be corrected into a plurality of character fragments with preset lengths; and searching the domain vocabulary entry for the character segment in a preset domain knowledge base according to the character pinyin corresponding to the character segment, wherein the domain knowledge base refers to a set containing a series of specialized vocabularies of the domain.
10. An electronic device, comprising:
One or more processors;
storage means for storing one or more programs,
When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-8.
11. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110279919.9A CN113051894B (en) | 2021-03-16 | 2021-03-16 | Text error correction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110279919.9A CN113051894B (en) | 2021-03-16 | 2021-03-16 | Text error correction method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113051894A CN113051894A (en) | 2021-06-29 |
CN113051894B true CN113051894B (en) | 2024-07-16 |
Family
ID=76512806
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110279919.9A Active CN113051894B (en) | 2021-03-16 | 2021-03-16 | Text error correction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113051894B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114239559B (en) * | 2021-11-15 | 2023-07-11 | 北京百度网讯科技有限公司 | Text error correction and text error correction model generation method, device, equipment and medium |
CN116757184B (en) * | 2023-08-18 | 2023-10-20 | 昆明理工大学 | Vietnam voice recognition text error correction method and system integrating pronunciation characteristics |
CN117787266B (en) * | 2023-12-26 | 2024-07-26 | 人民网股份有限公司 | Large language model text error correction method and device based on pre-training knowledge embedding |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111428494A (en) * | 2020-03-11 | 2020-07-17 | 中国平安人寿保险股份有限公司 | Intelligent error correction method, device and equipment for proper nouns and storage medium |
CN111523306A (en) * | 2019-01-17 | 2020-08-11 | 阿里巴巴集团控股有限公司 | Text error correction method, device and system |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103678271B (en) * | 2012-09-10 | 2016-09-14 | 华为技术有限公司 | A kind of text correction method and subscriber equipment |
US20140214401A1 (en) * | 2013-01-29 | 2014-07-31 | Tencent Technology (Shenzhen) Company Limited | Method and device for error correction model training and text error correction |
US10354009B2 (en) * | 2016-08-24 | 2019-07-16 | Microsoft Technology Licensing, Llc | Characteristic-pattern analysis of text |
CN107741928B (en) * | 2017-10-13 | 2021-01-26 | 四川长虹电器股份有限公司 | Method for correcting error of text after voice recognition based on domain recognition |
CN109753636A (en) * | 2017-11-01 | 2019-05-14 | 阿里巴巴集团控股有限公司 | Machine processing and text error correction method and device calculate equipment and storage medium |
CN109492202B (en) * | 2018-11-12 | 2022-12-27 | 浙江大学山东工业技术研究院 | Chinese error correction method based on pinyin coding and decoding model |
CN111626048A (en) * | 2020-05-22 | 2020-09-04 | 腾讯科技(深圳)有限公司 | Text error correction method, device, equipment and storage medium |
CN111695342B (en) * | 2020-06-12 | 2023-04-25 | 复旦大学 | Text content correction method based on context information |
CN112287670A (en) * | 2020-11-18 | 2021-01-29 | 北京明略软件系统有限公司 | Text error correction method, system, computer device and readable storage medium |
-
2021
- 2021-03-16 CN CN202110279919.9A patent/CN113051894B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111523306A (en) * | 2019-01-17 | 2020-08-11 | 阿里巴巴集团控股有限公司 | Text error correction method, device and system |
CN111428494A (en) * | 2020-03-11 | 2020-07-17 | 中国平安人寿保险股份有限公司 | Intelligent error correction method, device and equipment for proper nouns and storage medium |
Non-Patent Citations (1)
Title |
---|
基于数据增广和复制的中文语法错误纠正方法;汪权彬, 谭营;智能系统学报;第15卷(第1期);第99-105页 * |
Also Published As
Publication number | Publication date |
---|---|
CN113051894A (en) | 2021-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102401942B1 (en) | Method and apparatus for evaluating translation quality | |
CN113051894B (en) | Text error correction method and device | |
CN109376234B (en) | Method and device for training abstract generation model | |
JP7301922B2 (en) | Semantic retrieval method, device, electronic device, storage medium and computer program | |
US11899699B2 (en) | Keyword generating method, apparatus, device and storage medium | |
JP7335300B2 (en) | Knowledge pre-trained model training method, apparatus and electronic equipment | |
CN111611452B (en) | Method, system, equipment and storage medium for identifying ambiguity of search text | |
CN114861889B (en) | Deep learning model training method, target object detection method and device | |
JP2021033995A (en) | Text processing apparatus, method, device, and computer-readable storage medium | |
CN113299282B (en) | Voice recognition method, device, equipment and storage medium | |
CN111488742B (en) | Method and device for translation | |
WO2024045475A1 (en) | Speech recognition method and apparatus, and device and medium | |
WO2024146328A1 (en) | Training method for translation model, translation method, and device | |
US20180137098A1 (en) | Methods and systems for providing universal portability in machine learning | |
CN114743012B (en) | Text recognition method and device | |
KR102608867B1 (en) | Method for industry text increment, apparatus thereof, and computer program stored in medium | |
CN112542154B (en) | Text conversion method, text conversion device, computer readable storage medium and electronic equipment | |
CN110852057A (en) | Method and device for calculating text similarity | |
US20230153550A1 (en) | Machine Translation Method and Apparatus, Device and Storage Medium | |
CN115860003A (en) | Semantic role analysis method and device, electronic equipment and storage medium | |
CN112883711B (en) | Method and device for generating abstract and electronic equipment | |
CN115357710A (en) | Training method and device for table description text generation model and electronic equipment | |
CN112560466A (en) | Link entity association method and device, electronic equipment and storage medium | |
JP2017059216A (en) | Query calibration system and method | |
CN110162767B (en) | Text error correction method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: Room 221, 2 / F, block C, 18 Kechuang 11th Street, Beijing Economic and Technological Development Zone, 100176 Applicant after: Jingdong Technology Holding Co.,Ltd. Address before: Room 221, 2 / F, block C, 18 Kechuang 11th Street, Beijing Economic and Technological Development Zone, 100176 Applicant before: Jingdong Digital Technology Holding Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |