CN113051894A

CN113051894A - Text error correction method and device

Info

Publication number: CN113051894A
Application number: CN202110279919.9A
Authority: CN
Inventors: 王培英; 陈蒙
Original assignee: JD Digital Technology Holdings Co Ltd
Current assignee: JD Digital Technology Holdings Co Ltd
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2021-06-29

Abstract

The invention discloses a text error correction method and device, and relates to the technical field of computers. The specific implementation mode of the method comprises the following steps: acquiring a text to be corrected, and generating character pinyin for the text to be corrected; retrieving a domain entry for the text to be corrected in a preset domain knowledge base; inputting a text to be corrected, character pinyin and field entries into a text correction model, wherein the text correction model is obtained by training a training sample, the training sample comprises an error text and a corresponding correct text, and training information input for the text correction model comprises: the method comprises the following steps of (1) obtaining a wrong text, a character pinyin of the wrong text and a field entry of the wrong text; and correcting the error of the text to be corrected by using the text error correction model. The embodiment can improve the accuracy and efficiency of text error correction.

Description

Text error correction method and device

Technical Field

The invention relates to the technical field of computers, in particular to a text error correction method and device.

Background

In many application scenarios, such as retrieval, text conversion, intention recognition, intelligent customer service, etc., text correction (i.e., a process of correcting errors in a text) is involved, and lexical analysis, intention recognition, etc. can be accurately performed on the text in a downstream processing process, so that the text correction plays a role in protecting driving from the perspective of the whole natural language processing technology.

Currently, text error correction generally relies on a manually constructed wrongly written dictionary for error matching and correction.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

due to the limitation of the wrong dictionary, some rare proper nouns and the like may not be included in the wrong dictionary, which results in lower accuracy and efficiency of text error correction.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for text error correction, which can effectively improve accuracy and efficiency of text error correction.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a text error correction method, including:

acquiring a text to be corrected, and generating character pinyin for the text to be corrected;

retrieving a domain entry for the text to be corrected in a preset domain knowledge base;

inputting the text to be corrected, the character pinyin and the field entries into a text correction model, wherein the text correction model is obtained by training a training sample, the training sample comprises an error text and a correct text corresponding to the error text, and the training information input for the text correction model comprises: the error text, the character pinyin of the error text and the domain entries of the error text;

and correcting the text to be corrected by using the text correction model, and outputting the corrected correct text.

Preferably, matching the domain entries for the text to be corrected includes:

dividing the text to be corrected into a plurality of character segments with preset lengths;

and retrieving a domain entry for the character segment in a preset domain knowledge base according to the character pinyin corresponding to the character segment.

Preferably, the text error correction method further includes:

respectively converting the text to be corrected, the character pinyin and the field entries into corresponding vector representations;

correcting the text to be corrected, including:

inputting the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry into the text correction model;

the text error correction model calculates the output probability distribution of characters based on the vector representation of the text to be error corrected, the vector representation of the character pinyin and the vector of the field entry;

and determining the characters included in the correct text according to the output probability distribution of the characters.

Preferably, calculating an output probability distribution of the character comprises:

respectively encoding the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry by using the encoder;

inputting the result of the encoding into a decoder comprised by the text correction model;

and the decoder calculates the output probability distribution of the characters according to the encoding result.

Preferably, the encoding, by the encoder, the vector representation of the text to be corrected, the vector representation of the character pinyin, and the vector representation of the field entry respectively includes:

the vector representation of the field entry is blended into the vector representation of the text to be corrected and the vector representation of the character pinyin;

and encoding the merged result.

Preferably, the text error correction method further includes:

determining a confusion set of each character in the text to be corrected, wherein the confusion set comprises a plurality of approximate characters;

based on the set of aliases, the decoder performs the step of calculating an output probability distribution for the character.

calculating the copy probability of each character included in the text to be corrected based on the text to be corrected and the field entry;

calculating the generation probability of the characters included in the word list based on the confusion set corresponding to each character;

and calculating the output probability distribution of each character according to the generation probability of the character included in the word list and the copying probability of each character included in the text to be corrected.

Preferably, the text error correction method further includes:

constructing a confusion set indication matrix for the text to be corrected according to the confusion set corresponding to each character;

the output range in the generating mode can be limited in the confusion set through the confusion set indication matrix, and the step of calculating the generating probability of each character included in the word list is executed.

Preferably, the text error correction method further includes:

constructing a loss function by using the output probability of each training sample;

training model parameters by minimizing the value of the loss function to obtain the text error correction model.

In a second aspect, an embodiment of the present invention provides an apparatus for correcting text, including: a text processing module, a domain matching module, and a text correction module, wherein,

the text processing module is used for acquiring a text to be corrected and generating character pinyin for the text to be corrected;

the domain matching module is used for dividing the text to be corrected into a plurality of character segments with preset lengths, and retrieving domain entries for the character segments in a preset domain knowledge base according to character pinyin corresponding to the character segments;

the text error correction module is used for inputting the text to be corrected, the character pinyin and the field entries into a text error correction model, correcting the text to be corrected by using the text error correction model and outputting a corrected text; the text correction model is obtained by training a training sample, the training sample comprises an error text and a correct text corresponding to the error text, and the training information input for the text correction model comprises: the wrong text, the character pinyin of the wrong text and the domain entries of the wrong text.

One embodiment of the above invention has the following advantages or benefits: according to the scheme provided by the application, the character pinyin is generated for the text to be corrected, the text to be corrected is divided into a plurality of character segments with preset lengths, the field entries are retrieved for the character segments in the preset field knowledge base according to the character pinyin corresponding to the character segments, namely the character pinyin and the field entries are introduced into the text to be corrected, and the introduction of the character pinyin and the field entries can increase characteristics for the text to be corrected and can reduce the range of copying or generating correct text of a text correction model, so that the accuracy of text correction and the efficiency of text correction are effectively improved.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of a main flow of a method of text error correction according to an embodiment of the invention;

FIG. 2 is a schematic diagram of the main structure of text correction according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a main flow of error correction of a text to be corrected according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a main flow of computing output probabilities of characters of a fusion domain entry according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a main flow of computing output probabilities of characters of a fusion domain entry based on a confusion set, according to an embodiment of the invention;

FIG. 6 is a schematic diagram of a main flow of computing output probabilities of characters of a fusion domain entry based on a confusion set, according to another embodiment of the present invention;

FIG. 7 is a schematic diagram of the main blocks of an apparatus for text correction according to an embodiment of the present invention;

FIG. 8 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 9 is a schematic structural diagram of a computer system suitable for implementing a terminal device or a server according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a method for text error correction according to an embodiment of the present invention, and as shown in fig. 1, the method for text error correction may include the following steps:

step S101: acquiring a text to be corrected, and generating character pinyin for the text to be corrected;

because characters in the text usually have a plurality of homophones, the error correction direction of the text to be corrected can be better expanded through the character pinyin. For example, the text to be corrected is "the washing machine is a rainbow unit", and "zhe kuan xi ji shi xi hong yi ti de ma" can be generated through the step. Because the pinyin can correspond to various homophones, the character pinyin is introduced through the step, and more error correction features can be provided for the later.

Step S102: retrieving a domain entry for the text to be corrected in a preset domain knowledge base;

step S103: inputting a text to be corrected, character pinyin and field entries into a text correction model, wherein the text correction model is obtained by training a training sample, the training sample comprises an error text and a correct text corresponding to the error text, and training information input for the text correction model comprises: the method comprises the following steps of (1) obtaining a wrong text, a character pinyin of the wrong text and a field entry of the wrong text;

step S104: and correcting the text to be corrected by using the text correction model, and outputting the corrected correct text.

The text to be corrected can be derived from information input by the user into the search box, information input by the user on the intelligent question and answer page, primary text converted from voice and the like. The primary text converted from the voice is the text directly converted by the existing voice text conversion technology, and due to the limitation of the voice conversion technology, the converted primary text may have character errors.

The domain knowledge base refers to a set containing a series of professional words in the domain, such as a property knowledge base of the commodity (in which information such as model series and property nouns of the commodity is listed). The domain knowledge base may include various conventional domain knowledge and the data of the lesser masses, the data distributed in the long tail (the data distributed in the long tail refers to the words which are peculiar, common and rare in a certain domain), etc.

The text error correction model combines a model framework in a Network Machine Translation (NMT) task and a Transformer framework in a self-attention mechanism. The model framework adopts an end-to-end text error correction model combined with domain knowledge. The model framework in the NMT task mainly adopts a sequence model of an encoder and a decoder (encoder-decoder), wherein the encoder is mainly responsible for encoding a source language to obtain a final feature vector, and the decoder generates a target language sequence, namely an error correction text, according to the feature vector information. The method and the device improve the text error correction efficiency by introducing a self-attention mechanism transform framework into a sequence model of an encoder & decoder (encoder-decoder) to determine whether characters in corrected correct text are from copying or generating.

In the embodiment shown in fig. 1, in the scheme provided by the application, character pinyins are generated for the text to be corrected, the text to be corrected is divided into a plurality of character segments with preset lengths, and according to the character pinyins corresponding to the character segments, the field entries are matched for the character segments in a preset field knowledge base, that is, the character pinyins and the field entries are introduced into the text to be corrected, so that on one hand, features can be added to the text to be corrected, on the other hand, the range of copying or generating a correct text by a text correction model can be reduced, and therefore, the accuracy of text correction and the efficiency of text correction are effectively improved.

The specific implementation of step S102 may include: dividing a text to be corrected into a plurality of character segments with preset lengths; and retrieving the domain entries for the character segments in a preset domain knowledge base according to the character pinyin corresponding to the character segments. The retrieved domain entry is a domain entry associated with the character fragment. The character segment with the preset length can be a character segment with a length of 2 characters or a character segment with a length of 3 characters, and the preset length can be correspondingly set according to the requirements of users. For example, for the text to be corrected, "the washing machine is a rainbow, the text is divided into character segments with a length of 3 characters: the "washing machine" and the "rainbow one" are used for filtering out conventional nonsense words, such as "yes" and "no", in the text to be corrected. Aiming at the pinyin xi hong yi of the character segment of the rainbow I, the field entries are matched in the field knowledge base, and the washing and drying are integrated. Namely, the text to be corrected is divided into l character segments through the process, and accordingly, a domain entry set can be obtained:

K＝{k₁,k₂,…,k_l}。

the main architecture of text correction by the text correction model can be as shown in fig. 2. As can be seen from fig. 2, in the solution provided by the embodiment of the present invention, the encoder encodes the text to be corrected (for example, "this washing machine is a rainbow-integrated machine"), the pinyin of the text to be corrected (for example, "zhe kuan xi yi ji shi xi hong yi ti ma") and the related domain entry (for example, "washing and drying integrated machine"), and the domain knowledge is merged into the code of the original text by a cross attention mechanism, the copying probability of each character in the text to be corrected is calculated (wherein cross attention means that the domain knowledge is merged into the code of the original text), and calculating the generation probability of each character in the text to be corrected through a decoder, obtaining the output probability of each character by using the copy probability and the generation probability of each character, and outputting the corrected correct text according to the output probability.

For the main process of the scheme provided by the embodiment of the present invention shown in fig. 2, the embodiments of the present invention respectively improve the processes of encoding, decoding, calculating the duplication probability, calculating the generation probability, calculating the output probability distribution, and the like.

In an embodiment of the present invention, the text error correction method may further include: respectively converting the text to be corrected, the character pinyin and the field entries into corresponding vector representations; accordingly, as shown in fig. 3, a specific embodiment of correcting the text to be corrected may include the following steps:

step S301: inputting the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry into a text correction model;

the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry can be realized by the conventional text vector conversion mode, for example, the text to be corrected and the character pinyin are converted into a vector representation set directly through an encoder

And setting the domain term K to { K ═ K₁,k₂,…,k_lConverting each field entry into vector representation to obtain a vector representation set of the field entries

Step S302: the text error correction model calculates the output probability distribution of the characters based on the vector representation of the text to be error corrected, the vector representation of the character pinyin and the vector of the field entry;

the vector representation of the character pinyin and the vector of the field entry are introduced in the process of calculating the output probability distribution of the characters, namely the characteristics of the text to be corrected are increased, so that the accuracy of calculating the output probability can be effectively improved.

Step S303: and determining the characters included in the correct text according to the output probability distribution of the characters.

In an embodiment of the present invention, as shown in fig. 4, the text error correction method may further include the following steps:

step S401: respectively coding vector representation of a text to be corrected, vector representation of character pinyin and vector representation of a field entry by using a coder;

the specific implementation manner of step S401 may include: the vector representation of the field entry is blended into the vector representation of the text to be corrected and the vector representation of the character pinyin; and encoding the merged result.

The encoding process can be realized by adopting the following calculation formula (1) and calculation formula (2) to obtain the character vector representation of the fusion domain entries.

Wherein alpha is_ijRepresenting the attention weight of the fusion of the ith character and the jth field entry in the text to be corrected; softmax () characterizes the softmax function; w_q、W_kAnd W_vCharacterizing the parameter matrix trained by the encoder, d characterizing

Dimension (d);

representing the vector representation of the ith character in the text to be corrected;

characterizing a vector representation of a jth character in the retrieved domain entry for the ith character; h is_iA character representation characterizing the fused knowledge.

That is, the result of the above calculation formula (2) is the output result of the encoder (the character of the fusion knowledge represents the corresponding code).

Step S402: inputting the result of the encoding into a decoder comprised by the text error correction model;

step S403: and a step of calculating the output probability distribution of the character by the decoder according to the encoding result.

Because the field entries are introduced into the coding, the decoder can more accurately calculate the output probability distribution of the characters in the decoding process.

In an embodiment of the present invention, as shown in fig. 5, the text error correction method may further include the following steps:

step S501: determining a confusion set of each character in the text to be corrected, wherein the confusion set comprises a plurality of approximate characters;

the specific implementation manner of step S501 may be: the obfuscated sets of individual characters are stored in a database in advance, as shown in table 1 below. And searching the confusion set of each character in the text to be corrected in a searching mode.

TABLE 1

Character(s)	Obfuscated collections
		Lifting of wine	Shengsheng nephew Sheng … Sheng rope-stay Shengsheng province
Article (Chinese character)	Weak mosquito smelling pattern … for preventing mosquito from being asked
		…	….
Heart with heart-shaped	Xinxinxinxin zinc core … dispute Xinxiong

The specific implementation manner of step S501 may also be: and searching an approximate character, a near-phonetic character and a near-form character for each character in the text to be corrected, and combining the searched approximate character, near-phonetic character and near-form character into a confusion set of corresponding characters.

Step S502: based on the confusion set, a step of calculating output probabilities of characters in the text to be corrected is performed.

As shown in fig. 6, the step S502 described above performs a process of calculating an output probability of a character in a text to be corrected, and may include the following steps:

step S601: calculating the copying probability of each character included in the text to be corrected based on the text to be corrected and the field entries;

the calculation process of the step is calculated by the following calculation formula (3).

Wherein the content of the first and second substances,

representing the replication probability of the ith character in the text to be corrected at the t-th decoding moment (the replication probability refers to the probability of replicating the ith character in the text to be corrected from a source end (the text to be corrected and a domain entry);

wherein s is_tRepresenting a hidden state of decoding at the moment t; w_qAnd W_kRepresenting a parameter matrix obtained by training;

the representation of the original input text (text to be corrected) and the coded representation of the domain knowledge are represented.

Step S602: calculating the generation probability of the characters included in the word list based on the confusion set corresponding to each character;

the vocabulary refers to a vocabulary formed by characters in a confusion set corresponding to each character.

The specific implementation manner of step S602 may include: constructing a confusion set indication matrix for the text to be corrected according to the confusion set corresponding to each character; by the confusion set indication matrix, the output range in the generation mode can be limited in the confusion set, and the generation probability of each character in the word list is calculated.

Confusion set indication matrix M ∈ R^g*|V|The | V | represents the number of characters included in all confusion sets corresponding to the text to be corrected, and g represents the length of the text to be corrected; element M in M_ifTake the value 0 or 1. Wherein, the element M in M_ifThe value can be calculated by the following calculation formula (4).

By calculating the generation probability of each character by the calculation formula (5) using the confusion set indication matrix M, the output range in the generation mode can be limited to the confusion set:

representing the generation probability of the text to be corrected after the confusion set at the time t is subjected to error correction decoding;

representing a reference generation probability at the t moment of correcting the error of the text to be corrected, which is preset in a decoder; m_iA new confusion set indication matrix composed of the elements related to the character i obtained from the confusion set indication matrix (i.e. a new confusion set indication matrix formed by the ith row in the confusion set indication matrix) is characterized.

Step S603: and calculating the output probability distribution of each character according to the generation probability of each character included in the word list and the copying probability of each character included in the text to be corrected.

Each of the characters is a character included in the vocabulary and a character included in the text to be corrected.

In this step, the calculation output probability can be calculated by the following calculation formula (6).

Wherein, P_t(i) Representing the output probability of the ith character in the text to be corrected at the time t when the text to be corrected is subjected to error correction decoding; beta represents the weight of the copy mode obtained by training, and the weight is used as a balance factor generated from the confusion word list and copied from the text to be corrected during decoding;

characterizing the result calculated by the above calculation formula (5); the product is

And representing the copying probability of the ith character in the text to be corrected at the t-th moment when the text to be corrected is subjected to error correction decoding. Calculated by the above calculation formula (3). The copy probability is the probability of copying from the source (with error correction text, domain entry).

In an embodiment of the present invention, the text error correction method may further include: constructing a loss function by using the output probability of each training sample; the model parameters are trained by minimizing the value of the loss function to obtain a text error correction model.

The loss function constructed by the above steps is as the following calculation formula (7).

Wherein loss represents a loss value; p_t(i ') characterizing an output probability of an i' th character in a training text included in the training sample; t characterizes the total number of words of the training text comprised by the training sample.

In summary, the scheme provided by the embodiment of the invention integrates the domain knowledge into the encoding process, and improves the error detection and correction capability of the text error correction model. In addition, the pinyin characteristics of the characters to be corrected are added in the coding process; meanwhile, during decoding, the constraint of a confusion set is introduced into the generation range of the text error correction model, so that the search space is reduced, and the accuracy and the calculation efficiency of prediction are improved.

As shown in fig. 7, an embodiment of the present invention provides an apparatus 700 for text error correction, where the apparatus 700 for text error correction may include: a text processing module 701, a domain matching module 702, and a text correction module 703, wherein,

the text processing module 701 is used for acquiring a text to be corrected and generating character pinyin for the text to be corrected;

the domain matching module 702 is configured to divide the text to be corrected into a plurality of character segments with preset lengths, and match domain entries for the character segments in a preset domain knowledge base according to the character pinyin corresponding to the character segments;

the text error correction module 703 is configured to input the text to be error-corrected, the character pinyin and the field entry into the text error correction model, correct the text to be error-corrected by using the text error correction model, and output a correct text after error correction; the text error correction model is obtained by training a training sample, the text error correction model is obtained by training the training sample, the training sample comprises an error text and a correct text corresponding to the error text, and training information input for the text error correction model comprises: the wrong text, the character pinyin for the wrong text, and the domain entries for the wrong text.

In the embodiment of the present invention, the domain matching module 702 is configured to divide a text to be corrected into a plurality of character segments with preset lengths; and retrieving the domain entries for the character segments in a preset domain knowledge base according to the character pinyin corresponding to the character segments.

In the embodiment of the present invention, the text error correction module 703 is configured to convert the text to be error corrected, the character pinyin, and the field entry into corresponding vector representations, respectively; inputting the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the field entry into a text correction model; the text error correction model calculates the output probability distribution of the characters based on the vector representation of the text to be error corrected, the vector representation of the character pinyin and the vector of the field entry; and determining the characters included in the correct text according to the output probability distribution of the characters.

In the embodiment of the present invention, the text error correction module 703 is further configured to encode the vector representation of the text to be error corrected, the vector representation of the character pinyin, and the vector representation of the field entry by using an encoder, respectively; inputting the result of the encoding into a decoder comprised by the text error correction model; the decoder calculates the output probability of the character according to the result of the encoding.

In the embodiment of the present invention, the text error correction module 703 is configured to blend the vector representation of the domain entry into the vector representation of the text to be error corrected and the vector representation of the character pinyin; and encoding the merged result.

In this embodiment of the present invention, the text error correction module 703 is configured to determine an confusion set of each character in a text to be error corrected, where the confusion set includes a plurality of approximate characters; the step of calculating an output probability distribution of the character is performed by the decoder based on the confusion set.

In the embodiment of the present invention, the text error correction module 703 is configured to calculate, based on the text to be error corrected and the field entries retrieved for the text to be error corrected, a copy probability of each character included in the text to be error corrected; calculating the generation probability of each character in the word list based on the confusion set corresponding to each character; and calculating the output probability distribution of each character according to the generation probability of each character in the word list and the copying probability of each character in the text to be corrected.

In this embodiment of the present invention, the text error correction module 703 is further configured to construct a confusion set indication matrix for the text to be error corrected according to the confusion set corresponding to each character; the output range in the generating mode can be limited in the confusion set through the confusion set indication matrix, and the step of calculating the generating probability of each character in the word list is executed.

In this embodiment of the present invention, the text error correction module 703 is further configured to construct a loss function by using the output probability of each training sample; the model parameters are trained by minimizing the value of the loss function to obtain a text error correction model.

The text error correction device can be installed on the client side in a plug-in mode, and can also be installed on a server side communicating with the client side.

Fig. 8 shows an exemplary system architecture 800 of a text error correction method or apparatus to which embodiments of the invention may be applied.

As shown in fig. 8, the system architecture 800 may include

terminal devices

801, 802, 803, a network 804, and a server 805. The network 804 serves to provide a medium for communication links between the

terminal devices

801, 802, 803 and the server 805. Network 804 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

801, 802, 803 to interact with a server 805 over a network 804 to receive or send messages or the like. The

terminal devices

801, 802, 803 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

801, 802, 803 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 805 may be a server providing various services, for example, the server 805 encapsulates the trained text error correction model into a text error correction device or a plug-in, and may publish the text error correction device or the plug-in to various communication client applications installed in the

terminal devices

801, 802, 803 via the network 804. The server 805 may further encapsulate the trained text correction model into a text correction device and operate the text correction device.

For the case that the server 805 encapsulates the trained text error correction model into a text error correction device or into a plug-in, and can issue the text error correction device or plug-in to various communication client applications installed in the

terminal devices

801, 802, and 803 through the network 804, when the communication client applications on the

terminal devices

801, 802, and 803 receive externally input text, the text is taken as a text to be error corrected, and the text to be error corrected is subjected to error correction processing through the text error correction device or plug-in.

Aiming at the situation that the server 805 encapsulates the trained text error correction model into a text error correction device and operates the text error correction device, a text input by a user through a communication client application on the

terminal equipment

801, 802 and 803 is obtained, the text is used as a text to be error corrected, the text to be error corrected is subjected to error correction processing through a text error correction device or a plug-in, and the correct text after error correction processing is output to the communication client application on the

terminal equipment

801, 802 and 803.

It should be noted that the method for text error correction provided by the embodiment of the present invention may be executed by the

terminal devices

801, 802, and 803 or the server 805, and accordingly, the apparatus for text error correction may be disposed in the

terminal devices

801, 802, and 803 or the server 805.

It should be understood that the number of terminal devices, networks, and servers in fig. 8 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 9, shown is a block diagram of a computer system 900 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 9, the computer system 900 includes a Central Processing Unit (CPU)901 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the system 900 are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The above-described functions defined in the system of the present invention are executed when the computer program is executed by a Central Processing Unit (CPU) 901.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a text processing module, a domain matching module, and a text correction module. The names of the modules do not form a limitation on the modules themselves under certain conditions, for example, the text processing module may also be described as a "module for acquiring a text to be corrected and generating a character spelling for the text to be corrected".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: acquiring a text to be corrected, and generating character pinyin for the text to be corrected; retrieving a domain entry for the text to be corrected in a preset domain knowledge base; inputting a text to be corrected, character pinyin and field entries into a text correction model, wherein the text correction model is obtained by training a training sample, the training sample comprises an error text and a correct text corresponding to the error text, and training information input for the text correction model comprises: the method comprises the following steps of (1) obtaining a wrong text, a character pinyin of the wrong text and a field entry of the wrong text; and correcting the text to be corrected by using the text correction model, and outputting the corrected correct text.

According to the technical scheme provided by the embodiment of the invention, the character pinyin is generated for the text to be corrected, the text to be corrected is divided into a plurality of character segments with preset lengths, and the field entries are matched for the character segments in the preset field knowledge base according to the character pinyin corresponding to the character segments, namely the character pinyin and the field entries are introduced into the text to be corrected.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of text correction, comprising:

2. The method of claim 1, wherein matching a domain entry for the text to be corrected comprises:

3. The method of claim 1, further comprising:

correcting the text to be corrected, including:

4. The method of claim 3, wherein computing the output probability distribution for the character comprises:

5. The method of claim 4, wherein encoding the vector representation of the text to be corrected, the vector representation of the character pinyin and the vector representation of the domain entry by the encoder respectively comprises:

and encoding the merged result.

6. The method of claim 4, further comprising:

7. The method of claim 6, wherein computing the output probability distribution for the character comprises:

calculating the copy probability of each character included in the text to be corrected based on the text to be corrected and the field entry retrieved for the text to be corrected;

and calculating the output probability of each character according to the generation probability of the character included in the word list and the copying probability of each character included in the text to be corrected.

8. The method of claim 7, further comprising:

and limiting the output range in the generation mode to the confusion set through the confusion set indication matrix, and executing the step of calculating the generation probability of each character included in the word list.

9. The method of claim 6, further comprising:

10. An apparatus for correcting text, comprising: a text processing module, a domain matching module, and a text correction module, wherein,

the domain matching module is used for dividing the text to be corrected into a plurality of character segments with preset lengths and matching domain entries for the character segments in a preset domain knowledge base according to the character pinyin corresponding to the character segments;

the text error correction module is used for inputting the text to be corrected, the character pinyin and the field entries into a text error correction model, correcting the text to be corrected by using the text error correction model and outputting a corrected text; the text error correction model is obtained by training a training sample, the text error correction model is obtained by training the training sample, the training sample comprises an error text and a correct text corresponding to the error text, and training information input for the text error correction model comprises: the wrong text, the character pinyin of the wrong text and the domain entries of the wrong text.

11. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.

12. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-9.