CN113051894B - Text error correction method and device - Google Patents

Text error correction method and device Download PDF

Info

Publication number
CN113051894B
CN113051894B CN202110279919.9A CN202110279919A CN113051894B CN 113051894 B CN113051894 B CN 113051894B CN 202110279919 A CN202110279919 A CN 202110279919A CN 113051894 B CN113051894 B CN 113051894B
Authority
CN
China
Prior art keywords
text
corrected
character
pinyin
domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110279919.9A
Other languages
Chinese (zh)
Other versions
CN113051894A (en
Inventor
王培英
陈蒙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong Technology Holding Co Ltd
Original Assignee
Jingdong Technology Holding Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong Technology Holding Co Ltd filed Critical Jingdong Technology Holding Co Ltd
Priority to CN202110279919.9A priority Critical patent/CN113051894B/en
Publication of CN113051894A publication Critical patent/CN113051894A/en
Application granted granted Critical
Publication of CN113051894B publication Critical patent/CN113051894B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text error correction method and device, and relates to the technical field of computers. The specific implementation mode of the method comprises the following steps: acquiring a text to be corrected, and generating character pinyin for the text to be corrected; searching a domain entry for a text to be corrected in a preset domain knowledge base; inputting a text to be corrected, a character pinyin and an domain entry into a text correction model, wherein the text correction model is trained by training samples, the training samples comprise error texts and corresponding correct texts, and training information input for the text correction model comprises: error text, pinyin for characters of the error text, and domain entries for the error text; and correcting the text to be corrected by using the text correction model. The embodiment can improve the accuracy and efficiency of text error correction.

Description

Text error correction method and device
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for text error correction.
Background
In many application scenarios, such as searching, text conversion, intention recognition, intelligent customer service, etc., text error correction (i.e. a process of correcting errors in text) is involved, and the downstream processing process can accurately perform lexical analysis, intention recognition, etc. on the text, so that the text error correction plays a role in protecting navigation from the perspective of natural language processing overall technology.
Text correction currently generally relies on manually constructed wrongly written word dictionaries for error matching and correction.
In the process of implementing the present invention, the inventor finds that at least the following problems exist in the prior art:
due to the limitations of the misclassification dictionary, some relatively rare proper nouns and the like may not be included in the misclassification dictionary, resulting in lower accuracy and efficiency of text correction.
Disclosure of Invention
In view of this, the embodiments of the present invention provide a method and apparatus for text error correction, which can effectively improve the accuracy and efficiency of text error correction.
To achieve the above object, according to one aspect of the embodiments of the present invention, there is provided a text error correction method, including:
Acquiring a text to be corrected, and generating character pinyin for the text to be corrected;
searching a domain entry for the text to be corrected in a preset domain knowledge base;
Inputting the text to be corrected, the character pinyin and the field entry into a text correction model, wherein the text correction model is trained by a training sample, the training sample comprises an error text and a correct text corresponding to the error text, and training information input for the text correction model comprises: the error text, the pinyin of the characters of the error text and the domain entry of the error text;
and correcting the text to be corrected by using the text correction model, and outputting corrected correct text.
Preferably, the matching field entry for the text to be corrected includes:
dividing the text to be corrected into a plurality of character fragments with preset lengths;
And searching the domain entry for the character segment in a preset domain knowledge base according to the character pinyin corresponding to the character segment.
Preferably, the text error correction method further includes:
respectively converting the text to be corrected, the character pinyin and the field entry into corresponding vector representations;
correcting the text to be corrected, including:
Inputting the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry into the text correction model;
the text error correction model calculates the output probability distribution of characters based on the vector representation of the text to be corrected, the vector representation of the pinyin of the characters and the vector of the domain entry;
and determining the characters included in the correct text according to the output probability distribution of the characters.
Preferably, calculating the output probability distribution of the character includes:
Encoding the vector representation of the text to be corrected, the vector representation of the pinyin of the character and the vector representation of the domain entry by using the encoder respectively;
inputting the encoded result into a decoder included in the text error correction model;
and the decoder calculates the output probability distribution of the character according to the coding result.
Preferably, the encoding means is configured to encode the vector representation of the text to be corrected, the vector representation of the pinyin for the character, and the vector representation of the term of art, respectively, including:
merging the vector representation of the domain entry into the vector representation of the text to be corrected and the vector representation of the character pinyin;
And encoding the merged result.
Preferably, the text error correction method further includes:
Determining an confusion set of each character in the text to be corrected, wherein the confusion set comprises a plurality of approximate characters;
based on the confusion set, the decoder performs the step of calculating an output probability distribution for the character.
Preferably, calculating the output probability distribution of the character includes:
calculating the copy probability of each character included in the text to be corrected based on the text to be corrected and the field entry;
calculating the generation probability of the characters included in the word list based on the confusion set corresponding to each character;
And calculating the output probability distribution of each character according to the generation probability of the character included in the vocabulary and the replication probability of each character included in the text to be corrected.
Preferably, the text error correction method further includes:
Constructing an confusion set indication matrix for the text to be corrected according to the confusion set corresponding to each character;
The step of calculating the generation probability of each of the characters included in the vocabulary may be performed by limiting the output range in the generation mode to the confusion set by the confusion set indication matrix.
Preferably, the text error correction method further includes:
constructing a loss function by using the output probability of each training sample;
model parameters are trained by minimizing the value of the loss function to obtain the text error correction model.
In a second aspect, an embodiment of the present invention provides a text error correction apparatus, including: a text processing module, a field matching module and a text error correction module, wherein,
The text processing module is used for acquiring the text to be corrected and generating character pinyin for the text to be corrected;
The domain matching module is used for dividing the text to be corrected into a plurality of character segments with preset lengths, and searching domain entries for the character segments in a preset domain knowledge base according to the pinyin of the characters corresponding to the character segments;
The text correction module is used for inputting the text to be corrected, the character pinyin and the field entry into a text correction model, correcting the text to be corrected by using the text correction model, and outputting corrected correct text; the text error correction model is obtained by training a training sample, the training sample comprises an error text and a correct text corresponding to the error text, and training information input for the text error correction model comprises: the error text, the pinyin of the characters of the error text, and the domain term of the error text.
One embodiment of the above application has the following advantages or benefits: according to the scheme provided by the application, the character pinyin is generated for the text to be corrected, the text to be corrected is divided into a plurality of character fragments with preset lengths, and the domain vocabulary entry is searched for the character fragments in the preset domain knowledge base according to the character pinyin corresponding to the character fragments, namely, the character pinyin and the domain vocabulary entry are introduced into the text to be corrected, so that the character pinyin and the domain vocabulary entry can be introduced, on one hand, the characteristics of the text to be corrected can be added, and on the other hand, the range of copying or generating correct text of a text correction model can be reduced, and therefore, the accuracy of text correction and the text correction efficiency can be effectively improved.
Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of the main flow of a method of text correction according to an embodiment of the invention;
FIG. 2 is a schematic diagram of the main structure of text error correction according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of the main flow of error correction for text to be error corrected according to an embodiment of the invention;
FIG. 4 is a schematic diagram of the main flow of calculating the output probabilities of characters of lemmas in the fusion field according to embodiments of the present invention;
FIG. 5 is a schematic diagram of the main flow of computing the output probabilities of characters of the lemma vocabulary entries based on confusion sets, according to an embodiment of the invention;
FIG. 6 is a schematic diagram of the main flow of computing the output probabilities of characters of the lemma vocabulary entries based on confusion sets, according to another embodiment of the invention;
FIG. 7 is a schematic diagram of the main modules of an apparatus for text error correction according to an embodiment of the present invention;
FIG. 8 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;
fig. 9 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a method of text correction according to an embodiment of the present invention, as shown in fig. 1, the method of text correction may include the steps of:
Step S101: acquiring a text to be corrected, and generating character pinyin for the text to be corrected;
Since characters in a text generally have a variety of homophones, the correction direction of the text to be corrected can be better expanded through the pinyin of the characters. For example, "zhe kuan xi yi ji shi xi hong yi ti de ma" may be generated by this step with the text to be corrected being "the washing machine is a western rainbow. Since pinyin can correspond to a variety of homophones, introducing character pinyin through this step can provide more error correction features later.
Step S102: searching a domain entry for a text to be corrected in a preset domain knowledge base;
Step S103: inputting a text to be corrected, a character pinyin and an domain entry into a text correction model, wherein the text correction model is trained by training samples, the training samples comprise error texts and correct texts corresponding to the error texts, and training information input into the text correction model comprises: error text, pinyin for characters of the error text, and domain entries for the error text;
Step S104: and correcting the text to be corrected by using the text correction model, and outputting corrected correct text.
The text to be corrected can be derived from information of a search box input by a user, information input by the user on an intelligent question-answering page, primary text converted by voice and the like. The primary text converted by the voice is directly converted by the existing voice text conversion technology, and the text error condition of the converted primary text can be caused due to the limitation of the voice conversion technology.
The domain knowledge base refers to a collection of professional vocabularies in the domain, for example, an attribute knowledge base of a commodity (in which information such as a model series, attribute nouns, and the like of the commodity are listed). The domain knowledge base may include various general domain knowledge and data of the masses in a long-tail distribution (the data in the long-tail distribution refers to a unique, very useful, rare vocabulary in a certain domain), and the like.
The text correction model combines, among other things, a model framework in Network Machine Translation (NMT) tasks and a transducer framework for self-attention mechanisms. The model framework adopts an end-to-end text error correction model that incorporates domain knowledge. The model framework in the NMT task mainly adopts a sequence model of an encoder and a decoder (the encoder is mainly responsible for encoding a source language to obtain a final feature vector, and the decoder generates a target language sequence, namely an error correction text, according to the feature vector information. The application introduces a self-attention mechanism transducer framework in a sequence model of an encoder & decoder (encoder-decoder) to determine whether characters in corrected correct texts originate from copying or generating, thereby improving the text error correction efficiency.
In the embodiment shown in fig. 1, according to the scheme provided by the application, by generating the character pinyin for the text to be corrected and dividing the text to be corrected into a plurality of character segments with preset lengths, and according to the character pinyin corresponding to the character segments, matching the field vocabulary entries for the character segments in a preset field knowledge base, namely, introducing the character pinyin and the field vocabulary entries for the text to be corrected, wherein the character pinyin and the field vocabulary entries are introduced, so that on one hand, features can be added for the text to be corrected, and on the other hand, the range of copying or generating correct text by a text correction model can be reduced, thereby effectively improving the accuracy of text correction and the efficiency of text correction.
The specific embodiment of step S102 may include: dividing a text to be corrected into a plurality of character fragments with preset lengths; and searching the domain entry for the character segment in a preset domain knowledge base according to the pinyin of the character corresponding to the character segment. The retrieved domain term is a domain term associated with the character segment. The character segments with the preset length can be character segments with the length of 2 characters or character segments with the length of 3 characters, and the like, and the preset length can be correspondingly set according to the requirements of users. For example, for the text to be corrected, "the washing machine is a western rainbow integrated machine", the text is divided into character segments with the length of 3 characters: "washing machine", "Sihon one", in which conventional nonsensical words, such as "yes", etc., are filtered out. The spelling "xi hong yi" of the character segment "Xihong one" is matched with the field vocabulary entry "washing and baking integrated" in the field knowledge base. Namely, the text to be corrected is divided into l character segments through the process, and accordingly, a field entry set can be obtained:
K={k1,k2,…,kl}。
The main architecture of text correction by the text correction model is shown in fig. 2. As can be seen from fig. 2, according to the scheme provided by the embodiment of the present invention, a text to be corrected (for example, "the washing machine is a west-siphon integrated machine"), a pinyin (for example, "zhe kuan xi yi ji shi xi hong yi ti de ma") of the text to be corrected, and related domain terms (for example, "wash-bake integrated machine") are encoded by an encoder, domain knowledge is fused into the encoding of an original text by means of a cross-attention mechanism, the replication probability of each character in the text to be corrected is calculated (wherein cross attention refers to cross-attention, i.e., the domain knowledge is fused into the encoding of the original text), the generation probability of each character in the text to be corrected is calculated by a decoder, the replication probability and the generation probability of each character are utilized to obtain the output probability of each character, and the corrected correct text is output according to the output probability.
For the main process of the scheme provided by the embodiment of the invention shown in fig. 2, the various embodiments of the invention respectively improve the processes of encoding process, decoding process, calculating the duplication probability, calculating the generation probability, calculating the output probability distribution and the like.
In an embodiment of the present invention, the text error correction method may further include: respectively converting the text to be corrected, the character pinyin and the field entry into corresponding vector representations; accordingly, as shown in fig. 3, the specific embodiment for correcting the text to be corrected may include the following steps:
Step S301: inputting the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry into a text correction model;
the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry can be realized by means of existing text conversion vectors, such as directly converting the text to be corrected and the character pinyin into a set of vector representations by an encoder And converting each domain term of the domain terms K= { K 1,k2,…,kl } into a vector representation, and obtaining a vector representation set of the domain terms
Step S302: the text error correction model calculates the output probability distribution of characters based on the vector representation of the text to be corrected, the vector representation of the pinyin of the characters and the vector of the domain entry;
the output probability distribution process of the calculated characters introduces vector representation of the pinyin of the characters and vectors of terms in the field, namely, the characteristics of the text to be corrected are added, so that the accuracy of calculating the output probability can be effectively improved.
Step S303: and determining the characters included in the correct text according to the output probability distribution of the characters.
In an embodiment of the present invention, as shown in fig. 4, the method for text error correction may further include the following steps:
step S401: encoding the vector representation of the text to be corrected, the vector representation of the pinyin of the character and the vector representation of the term of the field by using an encoder respectively;
The specific embodiment of step S401 may include: merging the vector representation of the domain entry into the vector representation of the text to be corrected and the vector representation of the character pinyin; and encoding the merged result.
The encoding process can adopt the following calculation formula (1) and the following calculation formula (2) to realize the character vector representation of the fusion field entry.
Wherein alpha ij represents the attention weight of the text to be corrected after the i character and the j field entry are fused; softmax () characterizes the softmax function; w q、Wk and W v represent parameter matrices trained by the encoder, and d represents the parameter matrices trained by the encoderIs a dimension of (2); Representing a vector representation of an ith character in the text to be corrected; Representing a vector representation of a j-th character in the field entry retrieved for the i-th character; h i represents a character representation of the fused knowledge.
That is, the result of the above calculation formula (2) is the output result of the encoder (the character of the fusion knowledge represents the corresponding code).
Step S402: inputting the encoded result into a decoder included in the text error correction model;
step S403: and a step in which the decoder calculates the output probability distribution of the character based on the result of the encoding.
Because the field entry is introduced in the coding, the decoder can calculate the output probability distribution of the characters more accurately in the decoding process.
In an embodiment of the present invention, as shown in fig. 5, the method for text error correction may further include the following steps:
Step S501: determining an confusion set of each character in the text to be corrected, wherein the confusion set comprises a plurality of approximate characters;
The specific implementation manner of this step S501 may be: the confusion sets for each character are stored in advance in the database as shown in table 1 below. And searching the confusion set of each character in the text to be corrected by a searching mode.
TABLE 1
Character(s) Confusion set
Lifting device Shengsheng nephew sound …, shengsheng province
Text (A) Weatherproof and turbulent mosquito smelling line …
Heart shape Xinxinxinxin zinc core … dispute Xinzhen
The specific implementation manner of this step S501 may also be: searching approximate characters, near-voice characters and near-shape characters for each character in the text to be corrected, and combining the searched approximate characters, near-voice characters and near-shape characters into a confusion set of the corresponding characters.
Step S502: based on the confusion set, a step of calculating an output probability of a character in the text to be corrected is performed.
As shown in fig. 6, the process of calculating the output probability of the character in the text to be corrected in the step S502 may include the following steps:
Step S601: calculating the copy probability of each character included in the text to be corrected based on the text to be corrected and the field entry;
the calculation process of this step is calculated by the following calculation formula (3).
Wherein,Characterizing the replication probability (the replication probability refers to the probability of replication from a source (text to be corrected, field entry) for the ith character in the text to be corrected) at the nth moment of decoding; S t represents the hidden state of decoding at the moment t; w q and W k represent parameter matrices obtained by training; The original input text (text to be error corrected) representation is characterized, and the coded representation of domain knowledge.
Step S602: calculating the generation probability of the characters included in the vocabulary based on the confusion set corresponding to each character;
the vocabulary refers to a vocabulary formed by characters in the confusion set corresponding to each character.
The specific implementation manner of the step S602 may include: constructing an confusion set indication matrix for the text to be corrected according to the confusion set corresponding to each character; by the confusion set indication matrix, the output range in the generation mode can be limited to the confusion set, and the generation probability of each character in the word list is calculated.
The confusion set indication matrix M epsilon R g*|V|, wherein |V| represents the number of characters included in all confusion sets corresponding to the text to be corrected, and g represents the length of the text to be corrected; element M if in M takes on a value of 0 or 1. The value of the element M if in M can be calculated by the following calculation formula (4).
By calculating the generation probability of each character by the calculation formula (5) using the confusion set indication matrix M, the output range in the generation mode can be limited to the confusion set:
characterizing the generation probability of a t-time confusion set for carrying out error correction decoding on a text to be corrected; Representing a reference generation probability of t moment of error correction of a text to be corrected preset in a decoder; m i characterizes a new confusion-indicating matrix (i.e., a new confusion-indicating matrix formed by the ith row in the confusion-indicating matrix) made up of elements associated with character i obtained from the confusion-indicating matrix.
Step S603: and calculating the output probability distribution of each character according to the generation probability of each character included in the vocabulary and the replication probability of each character included in the text to be corrected.
Each character is a character included in a vocabulary and a character included in a text to be corrected.
In this step, the calculated output probability can be calculated by the following calculation formula (6).
Wherein, P t (i) represents the output probability of the ith character in the text to be corrected at the time t of the error correction decoding process of the text to be corrected; beta represents the weight of the copy mode obtained by training, and is used as a balance factor generated from the confusion word list and copied from the text to be corrected during decoding; Characterizing the result calculated by the above calculation formula (5); the method comprises And (3) representing the copy probability of the ith character in the text to be corrected at the time t of performing error correction decoding on the text to be corrected. Calculated by the calculation formula (3). The duplication probability is the probability of duplication from the source (text with error correction, domain entry).
In an embodiment of the present invention, the text error correction method may further include: constructing a loss function by using the output probability of each training sample; model parameters are trained by minimizing the value of the loss function to obtain a text error correction model.
The loss function constructed in the above step is calculated as in the following formula (7).
Wherein loss characterizes the loss value; p t (i ') represents the output probability of the i' th character in the training text included in the training sample; t represents the total number of words of the training text comprised by the training sample.
In summary, the scheme provided by the embodiment of the invention integrates the domain knowledge into the coding process, and improves the error detection and correction capability of the text error correction model. In addition, the spelling characteristic of the character to be corrected is added in the encoding process; meanwhile, during decoding, constraint of a confusion set is introduced into the generation range of the text error correction model, so that the search space is reduced, and the prediction accuracy and the calculation efficiency are improved.
As shown in fig. 7, an embodiment of the present invention provides a text error correction apparatus 700, where the text error correction apparatus 700 may include: a text processing module 701, a domain matching module 702, and a text error correction module 703, wherein,
The text processing module 701 is configured to obtain a text to be corrected, and generate a character pinyin for the text to be corrected;
The domain matching module 702 is configured to divide the text to be corrected into a plurality of character segments with preset lengths, and match domain entries for the character segments in a preset domain knowledge base according to the pinyin of the characters corresponding to the character segments;
The text correction module 703 is configured to input the text to be corrected, the character pinyin and the domain entry into a text correction model, correct the text to be corrected by using the text correction model, and output a corrected correct text; the text error correction model is obtained by training a training sample, the training sample comprises an error text and a correct text corresponding to the error text, and training information input for the text error correction model comprises: error text, pinyin for characters of the error text, and domain entries for the error text.
In the embodiment of the present invention, the field matching module 702 is configured to divide a text to be corrected into a plurality of character segments with preset lengths; and searching the domain entry for the character segment in a preset domain knowledge base according to the pinyin of the character corresponding to the character segment.
In the embodiment of the present invention, the text error correction module 703 is configured to convert the text to be corrected, the pinyin of the character, and the term of the field into corresponding vector representations respectively; inputting the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry into a text correction model; the text error correction model calculates the output probability distribution of characters based on the vector representation of the text to be corrected, the vector representation of the pinyin of the characters and the vector of the domain entry; and determining the characters included in the correct text according to the output probability distribution of the characters.
In the embodiment of the present invention, the text error correction module 703 is further configured to encode, with an encoder, a vector representation of a text to be error corrected, a vector representation of a pinyin of a character, and a vector representation of an entry of a domain, respectively; inputting the encoded result into a decoder included in the text error correction model; the decoder calculates the output probability of the character based on the result of the encoding.
In the embodiment of the present invention, the text correction module 703 is configured to integrate the vector representation of the term of the field into the vector representation of the text to be corrected and the vector representation of the pinyin of the character; and encoding the merged result.
In an embodiment of the present invention, the text error correction module 703 is configured to determine an confusion set for each character in the text to be error corrected, where the confusion set includes a plurality of approximate characters; the step of calculating an output probability distribution of the character is performed by the decoder based on the confusion set.
In the embodiment of the present invention, the text correction module 703 is configured to calculate a duplication probability of each character included in the text to be corrected based on the text to be corrected and the domain entry retrieved for the text to be corrected; calculating the generation probability of each character in the vocabulary based on the confusion set corresponding to each character; and calculating the output probability distribution of each character according to the generation probability of each character in the word list and the replication probability of each character in the text to be corrected.
In the embodiment of the present invention, the text error correction module 703 is further configured to construct an confusion set indication matrix for the text to be corrected according to the confusion set corresponding to each character; the step of calculating the generation probability of each character in the vocabulary may be performed by limiting the output range in the generation mode to the confusion set by the confusion set indication matrix.
In the embodiment of the present invention, the text error correction module 703 is further configured to construct a loss function by using the output probability of each training sample; model parameters are trained by minimizing the value of the loss function to obtain a text error correction model.
The text error correction device can be installed on a client in a plug-in mode or can be installed on a service end communicated with the client.
Fig. 8 illustrates an exemplary system architecture 800 of a method of text correction or an apparatus of text correction to which embodiments of the present invention may be applied.
As shown in fig. 8, a system architecture 800 may include terminal devices 801, 802, 803, a network 804, and a server 805. The network 804 serves as a medium for providing communication links between the terminal devices 801, 802, 803 and the server 805. The network 804 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
A user may interact with the server 805 through the network 804 using the terminal devices 801, 802, 803 to receive or send messages or the like. Various communication client applications such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 801, 802, 803.
The terminal devices 801, 802, 803 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 805 may be a server providing various services, for example, the server 805 encapsulates the trained text correction model into text correction devices or packages into plug-ins, and may publish the text correction devices or packages to various communication client applications installed by the terminal devices 801, 802, 803 via the network 804. The server 805 may also encapsulate the trained text correction model into and run the text correction device.
Aiming at the situation that the trained text correction model is packaged into a text correction device or a package in a plug-in unit by a server 805, and the text correction device or the package can be issued to various communication client applications installed by terminal equipment 801, 802 and 803 through a network 804, when the communication client applications on the terminal equipment 801, 802 and 803 receive an externally input text, the text is used as the text to be corrected, and the text to be corrected is subjected to correction processing through the text correction device or the package.
Aiming at the condition that the server 805 encapsulates the trained text correction model into a text correction device and runs the text correction device, acquiring a text input by a user through a communication client application on the terminal equipment 801, 802 and 803, taking the text as a text to be corrected, performing correction processing on the text to be corrected through the text correction device or a plug-in, and outputting the corrected correct text to the communication client application on the terminal equipment 801, 802 and 803.
It should be noted that, the method for text error correction provided in the embodiment of the present invention may be executed by the terminal devices 801, 802, 803 or the server 805, and accordingly, the apparatus for text error correction may be set in the terminal devices 801, 802, 803 or the server 805.
It should be understood that the number of terminal devices, networks and servers in fig. 8 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 9, there is illustrated a schematic diagram of a computer system 900 suitable for use in implementing an embodiment of the present invention. The terminal device shown in fig. 9 is only an example, and should not impose any limitation on the functions and the scope of use of the embodiment of the present invention.
As shown in fig. 9, the computer system 900 includes a Central Processing Unit (CPU) 901, which can execute various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the system 900 are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.
The following components are connected to the I/O interface 905: an input section 906 including a keyboard, a mouse, and the like; an output portion 907 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 908 including a hard disk or the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 910 so that a computer program read out therefrom is installed into the storage section 908 as needed.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909 and/or installed from the removable medium 911. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 901.
The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, for example, as: a processor includes a text processing module, a domain matching module, and a text error correction module. The names of these modules do not in some way limit the module itself, and for example, the text processing module may also be described as "a module that obtains text to be corrected and generates character spellings for the text to be corrected".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include: acquiring a text to be corrected, and generating character pinyin for the text to be corrected; searching a domain entry for a text to be corrected in a preset domain knowledge base; inputting a text to be corrected, a character pinyin and an domain entry into a text correction model, wherein the text correction model is trained by training samples, the training samples comprise error texts and correct texts corresponding to the error texts, and training information input into the text correction model comprises: error text, pinyin for characters of the error text, and domain entries for the error text; and correcting the text to be corrected by using the text correction model, and outputting corrected correct text.
According to the technical scheme provided by the embodiment of the application, the character pinyin is generated for the text to be corrected, the text to be corrected is divided into a plurality of character segments with preset lengths, and the field vocabulary entries are matched for the character segments in the preset field knowledge base according to the character pinyin corresponding to the character segments, namely, the character pinyin and the field vocabulary entries are introduced into the text to be corrected, so that the character pinyin and the field vocabulary entries are introduced, the characteristics of the text to be corrected can be increased, the range of copying or generating correct text of a text correction model can be reduced, and the accuracy of text correction and the text correction efficiency can be effectively improved.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (11)

1. A method of text correction, comprising:
Acquiring a text to be corrected, and generating character pinyin for the text to be corrected;
searching a domain entry for the text to be corrected in a preset domain knowledge base;
Inputting the text to be corrected, the character pinyin and the field entry into a text correction model, wherein the text correction model is trained by a training sample, the training sample comprises an error text and a correct text corresponding to the error text, and training information input for the text correction model comprises: the error text, the pinyin of the characters of the error text and the domain entry of the error text;
correcting the text to be corrected by using the text correction model, and outputting corrected correct text;
and matching the field entry for the text to be corrected, including:
dividing the text to be corrected into a plurality of character fragments with preset lengths;
and searching the domain vocabulary entry for the character segment in a preset domain knowledge base according to the character pinyin corresponding to the character segment, wherein the domain knowledge base refers to a set containing a series of specialized vocabularies of the domain.
2. The method as recited in claim 1, further comprising:
respectively converting the text to be corrected, the character pinyin and the field entry into corresponding vector representations;
correcting the text to be corrected, including:
Inputting the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry into the text correction model;
the text error correction model calculates the output probability distribution of characters based on the vector representation of the text to be corrected, the vector representation of the pinyin of the characters and the vector of the domain entry;
and determining the characters included in the correct text according to the output probability distribution of the characters.
3. The method of claim 2, wherein calculating the output probability distribution of the character comprises:
Encoding the vector representation of the text to be corrected, the vector representation of the pinyin of the character and the vector representation of the field entry by using an encoder;
inputting the encoded result into a decoder included in the text error correction model;
the decoder calculates an output probability distribution of the character based on the result of the encoding.
4. A method according to claim 3, wherein encoding the vector representation of the text to be corrected, the vector representation of the pinyin for the character, and the vector representation of the domain entry, respectively, with the encoder comprises:
merging the vector representation of the domain entry into the vector representation of the text to be corrected and the vector representation of the character pinyin;
And encoding the merged result.
5. A method as claimed in claim 3, further comprising:
Determining an confusion set of each character in the text to be corrected, wherein the confusion set comprises a plurality of approximate characters;
based on the confusion set, the decoder performs the step of calculating an output probability distribution for the character.
6. The method of claim 5, wherein calculating the output probability distribution of the character comprises:
calculating the duplication probability of each character included in the text to be corrected based on the text to be corrected and the field entry retrieved for the text to be corrected;
calculating the generation probability of the characters included in the word list based on the confusion set corresponding to each character;
and calculating the output probability of each character according to the generation probability of the character included in the vocabulary and the replication probability of each character included in the text to be corrected.
7. The method as recited in claim 6, further comprising:
Constructing an confusion set indication matrix for the text to be corrected according to the confusion set corresponding to each character;
limiting the output range in the generation mode to be within the confusion set through the confusion set indication matrix, and executing the step of calculating the generation probability of each character included in the word list.
8. The method as recited in claim 5, further comprising:
constructing a loss function by using the output probability of each training sample;
model parameters are trained by minimizing the value of the loss function to obtain the text error correction model.
9. An apparatus for text correction, comprising: a text processing module, a field matching module and a text error correction module, wherein,
The text processing module is used for acquiring the text to be corrected and generating character pinyin for the text to be corrected;
The domain matching module is used for dividing the text to be corrected into a plurality of character segments with preset lengths, and matching domain entries for the character segments in a preset domain knowledge base according to the pinyin of the characters corresponding to the character segments;
The text correction module is used for inputting the text to be corrected, the character pinyin and the field entry into a text correction model, correcting the text to be corrected by using the text correction model, and outputting corrected correct text; the text error correction model is obtained by training a training sample, the training sample comprises an error text and a correct text corresponding to the error text, and training information input for the text error correction model comprises: the error text, the pinyin of the characters of the error text and the domain entry of the error text;
the field matching module is used for dividing the text to be corrected into a plurality of character fragments with preset lengths; and searching the domain vocabulary entry for the character segment in a preset domain knowledge base according to the character pinyin corresponding to the character segment, wherein the domain knowledge base refers to a set containing a series of specialized vocabularies of the domain.
10. An electronic device, comprising:
One or more processors;
storage means for storing one or more programs,
When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-8.
11. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-8.
CN202110279919.9A 2021-03-16 2021-03-16 Text error correction method and device Active CN113051894B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110279919.9A CN113051894B (en) 2021-03-16 2021-03-16 Text error correction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110279919.9A CN113051894B (en) 2021-03-16 2021-03-16 Text error correction method and device

Publications (2)

Publication Number Publication Date
CN113051894A CN113051894A (en) 2021-06-29
CN113051894B true CN113051894B (en) 2024-07-16

Family

ID=76512806

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110279919.9A Active CN113051894B (en) 2021-03-16 2021-03-16 Text error correction method and device

Country Status (1)

Country Link
CN (1) CN113051894B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114239559B (en) * 2021-11-15 2023-07-11 北京百度网讯科技有限公司 Text error correction and text error correction model generation method, device, equipment and medium
CN116757184B (en) * 2023-08-18 2023-10-20 昆明理工大学 Vietnam voice recognition text error correction method and system integrating pronunciation characteristics
CN117787266B (en) * 2023-12-26 2024-07-26 人民网股份有限公司 Large language model text error correction method and device based on pre-training knowledge embedding

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428494A (en) * 2020-03-11 2020-07-17 中国平安人寿保险股份有限公司 Intelligent error correction method, device and equipment for proper nouns and storage medium
CN111523306A (en) * 2019-01-17 2020-08-11 阿里巴巴集团控股有限公司 Text error correction method, device and system

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678271B (en) * 2012-09-10 2016-09-14 华为技术有限公司 A kind of text correction method and subscriber equipment
US20140214401A1 (en) * 2013-01-29 2014-07-31 Tencent Technology (Shenzhen) Company Limited Method and device for error correction model training and text error correction
US10354009B2 (en) * 2016-08-24 2019-07-16 Microsoft Technology Licensing, Llc Characteristic-pattern analysis of text
CN107741928B (en) * 2017-10-13 2021-01-26 四川长虹电器股份有限公司 Method for correcting error of text after voice recognition based on domain recognition
CN109753636A (en) * 2017-11-01 2019-05-14 阿里巴巴集团控股有限公司 Machine processing and text error correction method and device calculate equipment and storage medium
CN109492202B (en) * 2018-11-12 2022-12-27 浙江大学山东工业技术研究院 Chinese error correction method based on pinyin coding and decoding model
CN111626048A (en) * 2020-05-22 2020-09-04 腾讯科技(深圳)有限公司 Text error correction method, device, equipment and storage medium
CN111695342B (en) * 2020-06-12 2023-04-25 复旦大学 Text content correction method based on context information
CN112287670A (en) * 2020-11-18 2021-01-29 北京明略软件系统有限公司 Text error correction method, system, computer device and readable storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523306A (en) * 2019-01-17 2020-08-11 阿里巴巴集团控股有限公司 Text error correction method, device and system
CN111428494A (en) * 2020-03-11 2020-07-17 中国平安人寿保险股份有限公司 Intelligent error correction method, device and equipment for proper nouns and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于数据增广和复制的中文语法错误纠正方法;汪权彬, 谭营;智能系统学报;第15卷(第1期);第99-105页 *

Also Published As

Publication number Publication date
CN113051894A (en) 2021-06-29

Similar Documents

Publication Publication Date Title
KR102401942B1 (en) Method and apparatus for evaluating translation quality
CN113051894B (en) Text error correction method and device
CN109376234B (en) Method and device for training abstract generation model
JP7301922B2 (en) Semantic retrieval method, device, electronic device, storage medium and computer program
US11899699B2 (en) Keyword generating method, apparatus, device and storage medium
JP7335300B2 (en) Knowledge pre-trained model training method, apparatus and electronic equipment
CN111611452B (en) Method, system, equipment and storage medium for identifying ambiguity of search text
CN114861889B (en) Deep learning model training method, target object detection method and device
JP2021033995A (en) Text processing apparatus, method, device, and computer-readable storage medium
CN113299282B (en) Voice recognition method, device, equipment and storage medium
CN111488742B (en) Method and device for translation
WO2024045475A1 (en) Speech recognition method and apparatus, and device and medium
WO2024146328A1 (en) Training method for translation model, translation method, and device
US20180137098A1 (en) Methods and systems for providing universal portability in machine learning
CN114743012B (en) Text recognition method and device
KR102608867B1 (en) Method for industry text increment, apparatus thereof, and computer program stored in medium
CN112542154B (en) Text conversion method, text conversion device, computer readable storage medium and electronic equipment
CN110852057A (en) Method and device for calculating text similarity
US20230153550A1 (en) Machine Translation Method and Apparatus, Device and Storage Medium
CN115860003A (en) Semantic role analysis method and device, electronic equipment and storage medium
CN112883711B (en) Method and device for generating abstract and electronic equipment
CN115357710A (en) Training method and device for table description text generation model and electronic equipment
CN112560466A (en) Link entity association method and device, electronic equipment and storage medium
JP2017059216A (en) Query calibration system and method
CN110162767B (en) Text error correction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 221, 2 / F, block C, 18 Kechuang 11th Street, Beijing Economic and Technological Development Zone, 100176

Applicant after: Jingdong Technology Holding Co.,Ltd.

Address before: Room 221, 2 / F, block C, 18 Kechuang 11th Street, Beijing Economic and Technological Development Zone, 100176

Applicant before: Jingdong Digital Technology Holding Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant