CN111460795B

CN111460795B - Text error correction method and system

Info

Publication number: CN111460795B
Application number: CN202010225790.9A
Authority: CN
Inventors: 王博
Original assignee: Unisound Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd
Priority date: 2020-03-26
Filing date: 2020-03-26
Publication date: 2023-05-26
Anticipated expiration: 2040-03-26
Also published as: CN111460795A

Abstract

The invention discloses a text error correction method and a system, wherein the method comprises the following steps: acquiring a text to be corrected; determining potential error entries according to the text to be corrected; obtaining an confusion set, wherein the confusion set is provided with a plurality of confusion word pairs and transition probabilities of the confusion word pairs; and generating an error correction result according to the potential error entry and the confusion set. By the technical scheme of the invention, the error correction result is more accurate.

Description

Text error correction method and system

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to a text error correction method and system.

Background

When text correction is performed through a language model, the transition probability of the confusing word pairs is set to a fixed value, so that some low-frequency words are often easily corrected to some high-frequency words. In some specialized fields, such as medical settings, this problem is emphasized by, for example, "administering a patient antibiotic therapy" being miscorrected to "administering a patient antibiotic therapy", and the low frequency word "administering" being miscorrected to the high frequency word "and". This causes inaccurate error correction results.

Disclosure of Invention

The invention provides a text error correction method and a text error correction system, wherein the technical scheme is as follows:

according to a first aspect of an embodiment of the present invention, there is provided a text error correction method, including:

acquiring a text to be corrected;

determining potential error entries according to the text to be corrected;

obtaining an confusion set, wherein the confusion set is provided with a plurality of confusion word pairs and transition probabilities of the confusion word pairs;

and generating an error correction result according to the potential error entry and the confusion set.

In one embodiment, the determining the potentially wrong entry according to the text to be corrected includes:

word segmentation is carried out on the text to be corrected to obtain a text to be corrected after word segmentation, wherein the text to be corrected after word segmentation has a first preset number of entries;

calculating the text to be corrected after word segmentation to obtain forward probabilities and reverse probabilities respectively corresponding to the first preset number of entries;

averaging the forward probabilities and the reverse probabilities which correspond to the first preset number of entries respectively to obtain average probabilities which correspond to the first preset number of entries respectively;

marking the entries with average probability smaller than a preset threshold value in the first preset number of entries to obtain a second preset number of entries, wherein the value of the second preset number is smaller than or equal to the first preset number;

and determining the second preset number of entries as the potential error entries, wherein the number of the potential error entries is the second preset number.

In one embodiment, the obtaining the confusion set includes:

acquiring corpus;

obtaining different confusion texts according to the corpus, wherein the different confusion texts are provided with the plurality of confusion word pairs;

counting the word frequency of each of the plurality of mixed word pairs;

calculating respective word frequencies according to the plurality of confusion word pairs to obtain transition probabilities of the plurality of confusion word pairs;

and determining the confusion set according to the different confusion texts and the transition probabilities of the plurality of confusion word pairs.

In one embodiment, the generating an error correction result from the potentially erroneous entry and the confusion set includes:

inquiring the confusion set based on the potential wrong entry to determine a candidate confusion word corresponding to the potential wrong entry and the transition probability of the potential wrong entry and the candidate confusion word, wherein the number of the candidate confusion words is a second preset number;

deleting the candidate confusion words which do not accord with the preset standard in the confusion set to obtain a deleted confusion set;

carrying out probability calculation on the deleted confusion set, and taking the value with the maximum probability in the confusion set as a candidate text;

and determining the error correction result according to the candidate text.

In one embodiment, the preset criteria include

Counting n-tuples at the corresponding positions of the potential error entries;

and after the candidate confusion word is used for replacing the potential wrong entry, the probability of the n-tuple is not reduced.

According to a second aspect of an embodiment of the present invention, there is provided a text error correction system, including:

the first acquisition module is used for acquiring the text to be corrected;

the determining module is used for determining potential error entries according to the text to be corrected;

the second acquisition module is used for acquiring an confusion set, wherein the confusion set is provided with a plurality of confusion word pairs and transition probabilities of the confusion word pairs;

and the generating module is used for generating an error correction result according to the potential error entry and the confusion set.

In one embodiment, the determining module includes:

the word segmentation sub-module is used for segmenting the text to be corrected to obtain segmented text to be corrected, wherein the segmented text to be corrected is provided with a first preset number of entries;

the first calculation sub-module is used for calculating the text to be corrected after word segmentation to obtain forward probabilities and reverse probabilities respectively corresponding to the first preset number of entries;

the averaging sub-module is used for carrying out averaging processing on the forward probabilities and the reverse probabilities which correspond to the first preset number of entries respectively so as to obtain average probabilities which correspond to the first preset number of entries respectively;

the marking sub-module is used for marking the entries with average probability smaller than a preset threshold value in the first preset number of entries to obtain a second preset number of entries, wherein the value of the second preset number is smaller than or equal to the first preset number;

the first determining submodule is used for determining that the second preset number of entries are the potential error entries, wherein the number of the potential error entries is the second preset number.

In one embodiment, the first acquisition module includes:

the first acquisition sub-module is used for acquiring corpus;

the second obtaining submodule is used for obtaining different confusion texts according to the corpus, wherein the different confusion texts are provided with the plurality of confusion word pairs;

the statistics sub-module is used for counting the word frequencies of the plurality of confusion word pairs;

the second calculation sub-module is used for calculating the respective word frequency according to the plurality of confusion word pairs so as to obtain the transition probability of the plurality of confusion word pairs;

and the second determining submodule is used for determining the confusion set according to the different confusion texts and the transition probabilities of the confusion word pairs.

In one embodiment, the generating module includes:

the query sub-module is used for querying the confusion set based on the potential wrong vocabulary entry to determine candidate confusion words corresponding to the potential wrong vocabulary entry and the transition probability of the potential wrong vocabulary entry and the candidate confusion words, wherein the number of the candidate confusion words is a second preset number;

a deleting sub-module, configured to delete candidate confusion words in the confusion set that do not meet a preset standard, so as to obtain a deleted confusion set;

a third calculation sub-module, configured to perform probability calculation on the deleted confusion set, and use a value with the maximum probability in the confusion set as a candidate text;

and determining the error correction result according to the candidate text.

In one embodiment, the preset criteria include

The technical scheme provided by the embodiment of the invention can comprise the following beneficial effects:

acquiring a text to be corrected; then, determining potential error entries according to the text to be corrected; and then, obtaining an confusion set, and further, generating an error correction result according to the potential error entry and the confusion set. Through the technical scheme, different confusion word pairs have different transition probabilities, and when error correction is carried out, low-frequency words are not corrected into high-frequency words, so that the error correction accuracy is improved, and the error correction result is more accurate.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a flow chart of a text error correction method according to an embodiment of the invention;

FIG. 2 is a flow chart of another text error correction method according to an embodiment of the invention;

FIG. 3 is a block diagram of a text error correction system in accordance with one embodiment of the present invention;

fig. 4 is a block diagram of another text error correction system in accordance with an embodiment of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.

Fig. 1 is a flowchart of a text error correction method according to an embodiment of the present invention, and as shown in fig. 1, the method may be implemented as steps S11-S14:

in step S11, obtaining a text to be corrected;

in step S12, determining potentially erroneous entries according to the text to be corrected;

by way of example, the text to be corrected is "i eat tomato stir-fried potatoes today", and then the potentially wrong term therein may be potatoes.

In step S13, an confusion set is obtained, wherein the confusion set has a plurality of confusion word pairs and transition probabilities of the confusion word pairs;

by way of example, one of the pairs of confusion words may be "potato" and "egg", "potato" and "banana", and the egg may be a low frequency word in the confusion set, the banana is a high frequency word, and the transition probability refers to the probability that the term a is entered (corrected) into the term B, and may be explained by the following formula, e.g. the text Y is defined by Y ₁ y ₂ y _3. ..y _n The text X consists of X ₁ x ₂ x ₃ ...x _n Composition is that:

wherein p (x) _i |y _i ) For the entry y _i To x _i Is equivalent to y _i Is input as x _i Is a probability of (2).

In step S14, an error correction result is generated according to the potentially erroneous entry and the confusion set. The error correction is carried out according to the confusion concentrated transfer probability and the confusion word pair, and the error correction result which can be generated by 'I eat tomato stir-fried potatoes today' can be 'I eat tomato stir-fried eggs today'.

As shown in fig. 2, in one embodiment, the above step S12 may be implemented as the following steps S121-S125:

in step S121, word segmentation is performed on the text to be corrected to obtain a text to be corrected after word segmentation, wherein the text to be corrected after word segmentation has a first preset number of entries;

in step S122, calculating the text to be corrected after word segmentation to obtain forward probabilities and reverse probabilities respectively corresponding to the first preset number of entries; the method comprises the steps of calculating a text to be corrected after word segmentation through a language model, and obtaining forward probability and reverse probability, wherein the forward probability refers to the correlation between a current term and a term in front of the current term, and the reverse probability refers to the correlation between the current term and a term behind the current term.

In step S123, the forward probabilities and the reverse probabilities corresponding to the first preset number of entries are averaged to obtain average probabilities corresponding to the first preset number of entries;

in step S124, the entries with average probability smaller than the preset threshold value in the first preset number of entries are marked to obtain a second preset number of entries, where the value of the second preset number is smaller than or equal to the first preset number;

in step S125, a second predetermined number of entries is determined as potentially erroneous entries, wherein the number of potentially erroneous entries is the second predetermined number.

Firstly, word segmentation is carried out on a text to be corrected, and the text to be corrected after word segmentation can be obtained; secondly, calculating the text to be corrected after word segmentation, and obtaining forward probabilities and reverse probabilities respectively corresponding to a first preset number of entries; then, carrying out averaging treatment on the forward probabilities and the reverse probabilities respectively corresponding to the first preset number of entries, so as to obtain average probabilities respectively corresponding to the first preset number of entries; then, marking the entries with average probability smaller than a preset threshold value in the first preset number of entries, so as to obtain a second preset number of entries; finally, determining the second preset number of entries as the potential erroneous entries, and through the technical scheme, misjudgment of the entries with errors can be avoided, and the potential erroneous entries can be accurately acquired.

In one embodiment, the obtaining the confusion set includes:

acquiring corpus;

counting the word frequency of each of the plurality of mixed word pairs;

the calculation of transition probability among different confusion word pairs is carried out by combining with the input habit of a user, and when people input texts, the cost of inputting low-frequency words is often greater than the cost of inputting high-frequency words, so that most errors in text input are that the low-frequency words are wrongly input as the high-frequency words, and the high-frequency words are rarely wrongly input as the low-frequency words. Therefore, according to word frequency, the transition probability from high-frequency words to low-frequency words is reduced, and the transition probability from low-frequency words to high-frequency words is increased. Specifically, the transition probability of the term y to the confusion word x is calculated using the following formula:

P _C to confuse the reference transition probabilities of word pairs, α is an adjustment factor that controls the impact of term frequency on the final transition probability, freq (x) is the term frequency of term x, and freq (y) is the term frequency of term y.

According to the method, the respective word frequency is calculated according to the plurality of confusion words, the transition probability of the plurality of confusion word pairs can be obtained, and further, when text error correction is carried out, the error correction result is more accurate.

carrying out probability calculation on the deleted confusion set, and taking the value with the maximum probability in the confusion set as a candidate text; wherein, the language model functions as follows: a sentence probability is calculated to determine whether a sentence is grammatically smooth. And (3) carrying out probability calculation on the deleted confusion set through a language model, wherein the deleted confusion set comprises 'i eat tomato stir-fried eggplant today' and 'i eat tomato stir-fried egg today', and the calculated probability is larger than that of the first half sentence, so that the candidate text is 'i eat tomato stir-fried egg today'.

And determining the error correction result according to the candidate text.

Through the technical scheme, the error correction result can be accurately obtained.

In one embodiment, the preset criteria include

exemplary, text is x ₁ x ₂ x ₃ x ₄ x ₅ x ₆ x ₇ x ₈ The potentially erroneous entry is x ₄ Candidate confusion entry is y ₄ Statistics of x ₄ All n-tuples of corresponding positions (4 of them>＝n>＝3)。

Illustratively, by N-gram, x in N-gram is calculated before and after the substitution of the confusion entry ₄ Replaced by y ₄ The probability of the n-tuple of the location context of the entry is not reduced.

Through the technical scheme, after the potential error word is replaced, the text context consistency of the replaced position is not degraded, and therefore error correction is reduced.

For the above-mentioned text error correction method provided by the embodiment of the present invention, the embodiment of the present invention further provides a text error correction system, as shown in fig. 3, where the system includes:

a first obtaining module 31, configured to obtain a text to be corrected;

a determining module 32, configured to determine a potentially erroneous entry according to the text to be corrected;

a second obtaining module 33, configured to obtain an confusion set, where the confusion set has a plurality of confusion word pairs and transition probabilities of the plurality of confusion word pairs;

and the generating module 34 is configured to generate an error correction result according to the potentially erroneous entry and the confusion set.

As shown in fig. 4, in one embodiment, the determining module 32 includes:

the word segmentation sub-module 321 is configured to segment the text to be corrected to obtain a text to be corrected after word segmentation, where the text to be corrected after word segmentation has a first preset number of entries;

a first calculation sub-module 322, configured to calculate the text to be corrected after word segmentation, so as to obtain a forward probability and a reverse probability that correspond to the first preset number of terms respectively;

an averaging submodule 323, configured to average the forward probabilities and the reverse probabilities corresponding to the first preset number of entries respectively, so as to obtain average probabilities corresponding to the first preset number of entries respectively;

a marking sub-module 324, configured to mark an entry with an average probability smaller than a preset threshold value in the first preset number of entries, so as to obtain a second preset number of entries, where a value of the second preset number is smaller than or equal to the first preset number;

a first determining sub-module 325, configured to determine the second preset number of terms as the potentially wrong terms, where the number of potentially wrong terms is the second preset number.

In one embodiment, the first acquisition module includes:

the first acquisition sub-module is used for acquiring corpus;

In one embodiment, the generating module includes:

and determining the error correction result according to the candidate text.

In one embodiment, the preset criteria include

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method for text correction, comprising:

acquiring a text to be corrected;

determining potential error entries according to the text to be corrected;

generating an error correction result according to the potential error entry and the confusion set;

the determining the potential error entry according to the text to be corrected comprises the following steps:

determining the second preset number of entries as the potential error entries, wherein the number of the potential error entries is the second preset number;

the obtaining the confusion set comprises:

acquiring corpus;

counting the word frequency of each of the plurality of mixed word pairs;

determining the confusion set according to the different confusion texts and the transition probabilities of the plurality of confusion word pairs;

the generating an error correction result according to the potentially erroneous entry and the confusion set includes:

determining the error correction result according to the candidate text;

the preset standard comprises

2. A text error correction system, comprising:

the first acquisition module is used for acquiring the text to be corrected;

the generating module is used for generating an error correction result according to the potential error entry and the confusion set;

the determining module includes:

the first determining submodule is used for determining that the second preset number of entries are the potential error entries, wherein the number of the potential error entries is the second preset number;

the first acquisition module includes:

the first acquisition sub-module is used for acquiring corpus;

a second determining sub-module, configured to determine the confusion set according to the different confusion texts and transition probabilities of the plurality of confusion word pairs;

the generating module comprises:

determining the error correction result according to the candidate text;

the preset standard comprises