CN111460795B - Text error correction method and system - Google Patents

Text error correction method and system Download PDF

Info

Publication number
CN111460795B
CN111460795B CN202010225790.9A CN202010225790A CN111460795B CN 111460795 B CN111460795 B CN 111460795B CN 202010225790 A CN202010225790 A CN 202010225790A CN 111460795 B CN111460795 B CN 111460795B
Authority
CN
China
Prior art keywords
confusion
entries
text
preset number
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010225790.9A
Other languages
Chinese (zh)
Other versions
CN111460795A (en
Inventor
王博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202010225790.9A priority Critical patent/CN111460795B/en
Publication of CN111460795A publication Critical patent/CN111460795A/en
Application granted granted Critical
Publication of CN111460795B publication Critical patent/CN111460795B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text error correction method and a system, wherein the method comprises the following steps: acquiring a text to be corrected; determining potential error entries according to the text to be corrected; obtaining an confusion set, wherein the confusion set is provided with a plurality of confusion word pairs and transition probabilities of the confusion word pairs; and generating an error correction result according to the potential error entry and the confusion set. By the technical scheme of the invention, the error correction result is more accurate.

Description

Text error correction method and system
Technical Field
The present invention relates to the field of information processing technologies, and in particular, to a text error correction method and system.
Background
When text correction is performed through a language model, the transition probability of the confusing word pairs is set to a fixed value, so that some low-frequency words are often easily corrected to some high-frequency words. In some specialized fields, such as medical settings, this problem is emphasized by, for example, "administering a patient antibiotic therapy" being miscorrected to "administering a patient antibiotic therapy", and the low frequency word "administering" being miscorrected to the high frequency word "and". This causes inaccurate error correction results.
Disclosure of Invention
The invention provides a text error correction method and a text error correction system, wherein the technical scheme is as follows:
according to a first aspect of an embodiment of the present invention, there is provided a text error correction method, including:
acquiring a text to be corrected;
determining potential error entries according to the text to be corrected;
obtaining an confusion set, wherein the confusion set is provided with a plurality of confusion word pairs and transition probabilities of the confusion word pairs;
and generating an error correction result according to the potential error entry and the confusion set.
In one embodiment, the determining the potentially wrong entry according to the text to be corrected includes:
word segmentation is carried out on the text to be corrected to obtain a text to be corrected after word segmentation, wherein the text to be corrected after word segmentation has a first preset number of entries;
calculating the text to be corrected after word segmentation to obtain forward probabilities and reverse probabilities respectively corresponding to the first preset number of entries;
averaging the forward probabilities and the reverse probabilities which correspond to the first preset number of entries respectively to obtain average probabilities which correspond to the first preset number of entries respectively;
marking the entries with average probability smaller than a preset threshold value in the first preset number of entries to obtain a second preset number of entries, wherein the value of the second preset number is smaller than or equal to the first preset number;
and determining the second preset number of entries as the potential error entries, wherein the number of the potential error entries is the second preset number.
In one embodiment, the obtaining the confusion set includes:
acquiring corpus;
obtaining different confusion texts according to the corpus, wherein the different confusion texts are provided with the plurality of confusion word pairs;
counting the word frequency of each of the plurality of mixed word pairs;
calculating respective word frequencies according to the plurality of confusion word pairs to obtain transition probabilities of the plurality of confusion word pairs;
and determining the confusion set according to the different confusion texts and the transition probabilities of the plurality of confusion word pairs.
In one embodiment, the generating an error correction result from the potentially erroneous entry and the confusion set includes:
inquiring the confusion set based on the potential wrong entry to determine a candidate confusion word corresponding to the potential wrong entry and the transition probability of the potential wrong entry and the candidate confusion word, wherein the number of the candidate confusion words is a second preset number;
deleting the candidate confusion words which do not accord with the preset standard in the confusion set to obtain a deleted confusion set;
carrying out probability calculation on the deleted confusion set, and taking the value with the maximum probability in the confusion set as a candidate text;
and determining the error correction result according to the candidate text.
In one embodiment, the preset criteria include
Counting n-tuples at the corresponding positions of the potential error entries;
and after the candidate confusion word is used for replacing the potential wrong entry, the probability of the n-tuple is not reduced.
According to a second aspect of an embodiment of the present invention, there is provided a text error correction system, including:
the first acquisition module is used for acquiring the text to be corrected;
the determining module is used for determining potential error entries according to the text to be corrected;
the second acquisition module is used for acquiring an confusion set, wherein the confusion set is provided with a plurality of confusion word pairs and transition probabilities of the confusion word pairs;
and the generating module is used for generating an error correction result according to the potential error entry and the confusion set.
In one embodiment, the determining module includes:
the word segmentation sub-module is used for segmenting the text to be corrected to obtain segmented text to be corrected, wherein the segmented text to be corrected is provided with a first preset number of entries;
the first calculation sub-module is used for calculating the text to be corrected after word segmentation to obtain forward probabilities and reverse probabilities respectively corresponding to the first preset number of entries;
the averaging sub-module is used for carrying out averaging processing on the forward probabilities and the reverse probabilities which correspond to the first preset number of entries respectively so as to obtain average probabilities which correspond to the first preset number of entries respectively;
the marking sub-module is used for marking the entries with average probability smaller than a preset threshold value in the first preset number of entries to obtain a second preset number of entries, wherein the value of the second preset number is smaller than or equal to the first preset number;
the first determining submodule is used for determining that the second preset number of entries are the potential error entries, wherein the number of the potential error entries is the second preset number.
In one embodiment, the first acquisition module includes:
the first acquisition sub-module is used for acquiring corpus;
the second obtaining submodule is used for obtaining different confusion texts according to the corpus, wherein the different confusion texts are provided with the plurality of confusion word pairs;
the statistics sub-module is used for counting the word frequencies of the plurality of confusion word pairs;
the second calculation sub-module is used for calculating the respective word frequency according to the plurality of confusion word pairs so as to obtain the transition probability of the plurality of confusion word pairs;
and the second determining submodule is used for determining the confusion set according to the different confusion texts and the transition probabilities of the confusion word pairs.
In one embodiment, the generating module includes:
the query sub-module is used for querying the confusion set based on the potential wrong vocabulary entry to determine candidate confusion words corresponding to the potential wrong vocabulary entry and the transition probability of the potential wrong vocabulary entry and the candidate confusion words, wherein the number of the candidate confusion words is a second preset number;
a deleting sub-module, configured to delete candidate confusion words in the confusion set that do not meet a preset standard, so as to obtain a deleted confusion set;
a third calculation sub-module, configured to perform probability calculation on the deleted confusion set, and use a value with the maximum probability in the confusion set as a candidate text;
and determining the error correction result according to the candidate text.
In one embodiment, the preset criteria include
Counting n-tuples at the corresponding positions of the potential error entries;
and after the candidate confusion word is used for replacing the potential wrong entry, the probability of the n-tuple is not reduced.
The technical scheme provided by the embodiment of the invention can comprise the following beneficial effects:
acquiring a text to be corrected; then, determining potential error entries according to the text to be corrected; and then, obtaining an confusion set, and further, generating an error correction result according to the potential error entry and the confusion set. Through the technical scheme, different confusion word pairs have different transition probabilities, and when error correction is carried out, low-frequency words are not corrected into high-frequency words, so that the error correction accuracy is improved, and the error correction result is more accurate.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is a flow chart of a text error correction method according to an embodiment of the invention;
FIG. 2 is a flow chart of another text error correction method according to an embodiment of the invention;
FIG. 3 is a block diagram of a text error correction system in accordance with one embodiment of the present invention;
fig. 4 is a block diagram of another text error correction system in accordance with an embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.
Fig. 1 is a flowchart of a text error correction method according to an embodiment of the present invention, and as shown in fig. 1, the method may be implemented as steps S11-S14:
in step S11, obtaining a text to be corrected;
in step S12, determining potentially erroneous entries according to the text to be corrected;
by way of example, the text to be corrected is "i eat tomato stir-fried potatoes today", and then the potentially wrong term therein may be potatoes.
In step S13, an confusion set is obtained, wherein the confusion set has a plurality of confusion word pairs and transition probabilities of the confusion word pairs;
by way of example, one of the pairs of confusion words may be "potato" and "egg", "potato" and "banana", and the egg may be a low frequency word in the confusion set, the banana is a high frequency word, and the transition probability refers to the probability that the term a is entered (corrected) into the term B, and may be explained by the following formula, e.g. the text Y is defined by Y 1 y 2 y 3. ..y n The text X consists of X 1 x 2 x 3 ...x n Composition is that:
Figure GDA0002499611220000051
wherein p (x) i |y i ) For the entry y i To x i Is equivalent to y i Is input as x i Is a probability of (2).
In step S14, an error correction result is generated according to the potentially erroneous entry and the confusion set. The error correction is carried out according to the confusion concentrated transfer probability and the confusion word pair, and the error correction result which can be generated by 'I eat tomato stir-fried potatoes today' can be 'I eat tomato stir-fried eggs today'.
Acquiring a text to be corrected; then, determining potential error entries according to the text to be corrected; and then, obtaining an confusion set, and further, generating an error correction result according to the potential error entry and the confusion set. Through the technical scheme, different confusion word pairs have different transition probabilities, and when error correction is carried out, low-frequency words are not corrected into high-frequency words, so that the error correction accuracy is improved, and the error correction result is more accurate.
As shown in fig. 2, in one embodiment, the above step S12 may be implemented as the following steps S121-S125:
in step S121, word segmentation is performed on the text to be corrected to obtain a text to be corrected after word segmentation, wherein the text to be corrected after word segmentation has a first preset number of entries;
in step S122, calculating the text to be corrected after word segmentation to obtain forward probabilities and reverse probabilities respectively corresponding to the first preset number of entries; the method comprises the steps of calculating a text to be corrected after word segmentation through a language model, and obtaining forward probability and reverse probability, wherein the forward probability refers to the correlation between a current term and a term in front of the current term, and the reverse probability refers to the correlation between the current term and a term behind the current term.
In step S123, the forward probabilities and the reverse probabilities corresponding to the first preset number of entries are averaged to obtain average probabilities corresponding to the first preset number of entries;
in step S124, the entries with average probability smaller than the preset threshold value in the first preset number of entries are marked to obtain a second preset number of entries, where the value of the second preset number is smaller than or equal to the first preset number;
in step S125, a second predetermined number of entries is determined as potentially erroneous entries, wherein the number of potentially erroneous entries is the second predetermined number.
Firstly, word segmentation is carried out on a text to be corrected, and the text to be corrected after word segmentation can be obtained; secondly, calculating the text to be corrected after word segmentation, and obtaining forward probabilities and reverse probabilities respectively corresponding to a first preset number of entries; then, carrying out averaging treatment on the forward probabilities and the reverse probabilities respectively corresponding to the first preset number of entries, so as to obtain average probabilities respectively corresponding to the first preset number of entries; then, marking the entries with average probability smaller than a preset threshold value in the first preset number of entries, so as to obtain a second preset number of entries; finally, determining the second preset number of entries as the potential erroneous entries, and through the technical scheme, misjudgment of the entries with errors can be avoided, and the potential erroneous entries can be accurately acquired.
In one embodiment, the obtaining the confusion set includes:
acquiring corpus;
obtaining different confusion texts according to the corpus, wherein the different confusion texts are provided with the plurality of confusion word pairs;
counting the word frequency of each of the plurality of mixed word pairs;
calculating respective word frequencies according to the plurality of confusion word pairs to obtain transition probabilities of the plurality of confusion word pairs;
the calculation of transition probability among different confusion word pairs is carried out by combining with the input habit of a user, and when people input texts, the cost of inputting low-frequency words is often greater than the cost of inputting high-frequency words, so that most errors in text input are that the low-frequency words are wrongly input as the high-frequency words, and the high-frequency words are rarely wrongly input as the low-frequency words. Therefore, according to word frequency, the transition probability from high-frequency words to low-frequency words is reduced, and the transition probability from low-frequency words to high-frequency words is increased. Specifically, the transition probability of the term y to the confusion word x is calculated using the following formula:
Figure GDA0002499611220000071
P C to confuse the reference transition probabilities of word pairs, α is an adjustment factor that controls the impact of term frequency on the final transition probability, freq (x) is the term frequency of term x, and freq (y) is the term frequency of term y.
And determining the confusion set according to the different confusion texts and the transition probabilities of the plurality of confusion word pairs.
According to the method, the respective word frequency is calculated according to the plurality of confusion words, the transition probability of the plurality of confusion word pairs can be obtained, and further, when text error correction is carried out, the error correction result is more accurate.
In one embodiment, the generating an error correction result from the potentially erroneous entry and the confusion set includes:
inquiring the confusion set based on the potential wrong entry to determine a candidate confusion word corresponding to the potential wrong entry and the transition probability of the potential wrong entry and the candidate confusion word, wherein the number of the candidate confusion words is a second preset number;
deleting the candidate confusion words which do not accord with the preset standard in the confusion set to obtain a deleted confusion set;
carrying out probability calculation on the deleted confusion set, and taking the value with the maximum probability in the confusion set as a candidate text; wherein, the language model functions as follows: a sentence probability is calculated to determine whether a sentence is grammatically smooth. And (3) carrying out probability calculation on the deleted confusion set through a language model, wherein the deleted confusion set comprises 'i eat tomato stir-fried eggplant today' and 'i eat tomato stir-fried egg today', and the calculated probability is larger than that of the first half sentence, so that the candidate text is 'i eat tomato stir-fried egg today'.
And determining the error correction result according to the candidate text.
Through the technical scheme, the error correction result can be accurately obtained.
In one embodiment, the preset criteria include
Counting n-tuples at the corresponding positions of the potential error entries;
exemplary, text is x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 The potentially erroneous entry is x 4 Candidate confusion entry is y 4 Statistics of x 4 All n-tuples of corresponding positions (4 of them>=n>=3)。
And after the candidate confusion word is used for replacing the potential wrong entry, the probability of the n-tuple is not reduced.
Illustratively, by N-gram, x in N-gram is calculated before and after the substitution of the confusion entry 4 Replaced by y 4 The probability of the n-tuple of the location context of the entry is not reduced.
Through the technical scheme, after the potential error word is replaced, the text context consistency of the replaced position is not degraded, and therefore error correction is reduced.
For the above-mentioned text error correction method provided by the embodiment of the present invention, the embodiment of the present invention further provides a text error correction system, as shown in fig. 3, where the system includes:
a first obtaining module 31, configured to obtain a text to be corrected;
a determining module 32, configured to determine a potentially erroneous entry according to the text to be corrected;
a second obtaining module 33, configured to obtain an confusion set, where the confusion set has a plurality of confusion word pairs and transition probabilities of the plurality of confusion word pairs;
and the generating module 34 is configured to generate an error correction result according to the potentially erroneous entry and the confusion set.
As shown in fig. 4, in one embodiment, the determining module 32 includes:
the word segmentation sub-module 321 is configured to segment the text to be corrected to obtain a text to be corrected after word segmentation, where the text to be corrected after word segmentation has a first preset number of entries;
a first calculation sub-module 322, configured to calculate the text to be corrected after word segmentation, so as to obtain a forward probability and a reverse probability that correspond to the first preset number of terms respectively;
an averaging submodule 323, configured to average the forward probabilities and the reverse probabilities corresponding to the first preset number of entries respectively, so as to obtain average probabilities corresponding to the first preset number of entries respectively;
a marking sub-module 324, configured to mark an entry with an average probability smaller than a preset threshold value in the first preset number of entries, so as to obtain a second preset number of entries, where a value of the second preset number is smaller than or equal to the first preset number;
a first determining sub-module 325, configured to determine the second preset number of terms as the potentially wrong terms, where the number of potentially wrong terms is the second preset number.
In one embodiment, the first acquisition module includes:
the first acquisition sub-module is used for acquiring corpus;
the second obtaining submodule is used for obtaining different confusion texts according to the corpus, wherein the different confusion texts are provided with the plurality of confusion word pairs;
the statistics sub-module is used for counting the word frequencies of the plurality of confusion word pairs;
the second calculation sub-module is used for calculating the respective word frequency according to the plurality of confusion word pairs so as to obtain the transition probability of the plurality of confusion word pairs;
and the second determining submodule is used for determining the confusion set according to the different confusion texts and the transition probabilities of the confusion word pairs.
In one embodiment, the generating module includes:
the query sub-module is used for querying the confusion set based on the potential wrong vocabulary entry to determine candidate confusion words corresponding to the potential wrong vocabulary entry and the transition probability of the potential wrong vocabulary entry and the candidate confusion words, wherein the number of the candidate confusion words is a second preset number;
a deleting sub-module, configured to delete candidate confusion words in the confusion set that do not meet a preset standard, so as to obtain a deleted confusion set;
a third calculation sub-module, configured to perform probability calculation on the deleted confusion set, and use a value with the maximum probability in the confusion set as a candidate text;
and determining the error correction result according to the candidate text.
In one embodiment, the preset criteria include
Counting n-tuples at the corresponding positions of the potential error entries;
and after the candidate confusion word is used for replacing the potential wrong entry, the probability of the n-tuple is not reduced.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (2)

1. A method for text correction, comprising:
acquiring a text to be corrected;
determining potential error entries according to the text to be corrected;
obtaining an confusion set, wherein the confusion set is provided with a plurality of confusion word pairs and transition probabilities of the confusion word pairs;
generating an error correction result according to the potential error entry and the confusion set;
the determining the potential error entry according to the text to be corrected comprises the following steps:
word segmentation is carried out on the text to be corrected to obtain a text to be corrected after word segmentation, wherein the text to be corrected after word segmentation has a first preset number of entries;
calculating the text to be corrected after word segmentation to obtain forward probabilities and reverse probabilities respectively corresponding to the first preset number of entries;
averaging the forward probabilities and the reverse probabilities which correspond to the first preset number of entries respectively to obtain average probabilities which correspond to the first preset number of entries respectively;
marking the entries with average probability smaller than a preset threshold value in the first preset number of entries to obtain a second preset number of entries, wherein the value of the second preset number is smaller than or equal to the first preset number;
determining the second preset number of entries as the potential error entries, wherein the number of the potential error entries is the second preset number;
the obtaining the confusion set comprises:
acquiring corpus;
obtaining different confusion texts according to the corpus, wherein the different confusion texts are provided with the plurality of confusion word pairs;
counting the word frequency of each of the plurality of mixed word pairs;
calculating respective word frequencies according to the plurality of confusion word pairs to obtain transition probabilities of the plurality of confusion word pairs;
determining the confusion set according to the different confusion texts and the transition probabilities of the plurality of confusion word pairs;
the generating an error correction result according to the potentially erroneous entry and the confusion set includes:
inquiring the confusion set based on the potential wrong entry to determine a candidate confusion word corresponding to the potential wrong entry and the transition probability of the potential wrong entry and the candidate confusion word, wherein the number of the candidate confusion words is a second preset number;
deleting the candidate confusion words which do not accord with the preset standard in the confusion set to obtain a deleted confusion set;
carrying out probability calculation on the deleted confusion set, and taking the value with the maximum probability in the confusion set as a candidate text;
determining the error correction result according to the candidate text;
the preset standard comprises
Counting n-tuples at the corresponding positions of the potential error entries;
and after the candidate confusion word is used for replacing the potential wrong entry, the probability of the n-tuple is not reduced.
2. A text error correction system, comprising:
the first acquisition module is used for acquiring the text to be corrected;
the determining module is used for determining potential error entries according to the text to be corrected;
the second acquisition module is used for acquiring an confusion set, wherein the confusion set is provided with a plurality of confusion word pairs and transition probabilities of the confusion word pairs;
the generating module is used for generating an error correction result according to the potential error entry and the confusion set;
the determining module includes:
the word segmentation sub-module is used for segmenting the text to be corrected to obtain segmented text to be corrected, wherein the segmented text to be corrected is provided with a first preset number of entries;
the first calculation sub-module is used for calculating the text to be corrected after word segmentation to obtain forward probabilities and reverse probabilities respectively corresponding to the first preset number of entries;
the averaging sub-module is used for carrying out averaging processing on the forward probabilities and the reverse probabilities which correspond to the first preset number of entries respectively so as to obtain average probabilities which correspond to the first preset number of entries respectively;
the marking sub-module is used for marking the entries with average probability smaller than a preset threshold value in the first preset number of entries to obtain a second preset number of entries, wherein the value of the second preset number is smaller than or equal to the first preset number;
the first determining submodule is used for determining that the second preset number of entries are the potential error entries, wherein the number of the potential error entries is the second preset number;
the first acquisition module includes:
the first acquisition sub-module is used for acquiring corpus;
the second obtaining submodule is used for obtaining different confusion texts according to the corpus, wherein the different confusion texts are provided with the plurality of confusion word pairs;
the statistics sub-module is used for counting the word frequencies of the plurality of confusion word pairs;
the second calculation sub-module is used for calculating the respective word frequency according to the plurality of confusion word pairs so as to obtain the transition probability of the plurality of confusion word pairs;
a second determining sub-module, configured to determine the confusion set according to the different confusion texts and transition probabilities of the plurality of confusion word pairs;
the generating module comprises:
the query sub-module is used for querying the confusion set based on the potential wrong vocabulary entry to determine candidate confusion words corresponding to the potential wrong vocabulary entry and the transition probability of the potential wrong vocabulary entry and the candidate confusion words, wherein the number of the candidate confusion words is a second preset number;
a deleting sub-module, configured to delete candidate confusion words in the confusion set that do not meet a preset standard, so as to obtain a deleted confusion set;
a third calculation sub-module, configured to perform probability calculation on the deleted confusion set, and use a value with the maximum probability in the confusion set as a candidate text;
determining the error correction result according to the candidate text;
the preset standard comprises
Counting n-tuples at the corresponding positions of the potential error entries;
and after the candidate confusion word is used for replacing the potential wrong entry, the probability of the n-tuple is not reduced.
CN202010225790.9A 2020-03-26 2020-03-26 Text error correction method and system Active CN111460795B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010225790.9A CN111460795B (en) 2020-03-26 2020-03-26 Text error correction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010225790.9A CN111460795B (en) 2020-03-26 2020-03-26 Text error correction method and system

Publications (2)

Publication Number Publication Date
CN111460795A CN111460795A (en) 2020-07-28
CN111460795B true CN111460795B (en) 2023-05-26

Family

ID=71683489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010225790.9A Active CN111460795B (en) 2020-03-26 2020-03-26 Text error correction method and system

Country Status (1)

Country Link
CN (1) CN111460795B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560450B (en) * 2020-12-11 2024-02-13 科大讯飞股份有限公司 Text error correction method and device
CN112528980B (en) * 2020-12-16 2022-02-15 北京华宇信息技术有限公司 OCR recognition result correction method and terminal and system thereof

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106873799A (en) * 2017-02-16 2017-06-20 北京百度网讯科技有限公司 Input method and device
WO2018120889A1 (en) * 2016-12-28 2018-07-05 平安科技(深圳)有限公司 Input sentence error correction method and device, electronic device, and medium
CN110210029A (en) * 2019-05-30 2019-09-06 浙江远传信息技术股份有限公司 Speech text error correction method, system, equipment and medium based on vertical field
CN110489760A (en) * 2019-09-17 2019-11-22 达而观信息科技(上海)有限公司 Based on deep neural network text auto-collation and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140032973A1 (en) * 2012-07-26 2014-01-30 James K. Baker Revocable Trust System and method for robust pattern analysis with detection and correction of errors

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018120889A1 (en) * 2016-12-28 2018-07-05 平安科技(深圳)有限公司 Input sentence error correction method and device, electronic device, and medium
CN106873799A (en) * 2017-02-16 2017-06-20 北京百度网讯科技有限公司 Input method and device
CN110210029A (en) * 2019-05-30 2019-09-06 浙江远传信息技术股份有限公司 Speech text error correction method, system, equipment and medium based on vertical field
CN110489760A (en) * 2019-09-17 2019-11-22 达而观信息科技(上海)有限公司 Based on deep neural network text auto-collation and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
段建勇 ; 关晓龙 ; .基于统计和特征相结合的查询纠错方法研究.现代图书情报技术.2016,(02),全文. *
骆卫华,罗振声,宫小瑾.中文文本自动校对技术的研究.计算机研究与发展.2004,(01),全文. *

Also Published As

Publication number Publication date
CN111460795A (en) 2020-07-28

Similar Documents

Publication Publication Date Title
CN111460795B (en) Text error correction method and system
WO2021174723A1 (en) Training sample expansion method and apparatus, electronic device, and storage medium
JP2882569B2 (en) Document format recognition execution method and apparatus
WO2015176518A1 (en) Reply information recommending method and device
US8494835B2 (en) Post-editing apparatus and method for correcting translation errors
US20070299664A1 (en) Automatic Text Correction
CN108090043B (en) Error correction report processing method and device based on artificial intelligence and readable medium
CN110083819B (en) Spelling error correction method, device, medium and electronic equipment
CN112580324B (en) Text error correction method, device, electronic equipment and storage medium
CN110188353B (en) Text error correction method and device
CN107679564A (en) Sample data recommends method and its device
CN111191441A (en) Text error correction method, device and storage medium
CN108182182B (en) Method and device for matching documents in translation database and computer readable storage medium
CN110543637A (en) Chinese word segmentation method and device
CN115358217A (en) Method and device for correcting words and sentences, readable storage medium and computer program product
CN107909097B (en) Method and device for updating samples in sample library
CN117744633B (en) Text error correction method and device and electronic equipment
CN117033597A (en) Intelligent question-answering method based on large language model
CN109597745B (en) Abnormal data processing method and device
CN110008231B (en) MySQL data backtracking method and storage medium
CN111797614B (en) Text processing method and device
Ive et al. Deep copycat networks for text-to-text generation
Tamames Text detective: a rule-based system for gene annotation in biomedical texts
CN110929514A (en) Text proofreading method and device, computer readable storage medium and electronic equipment
CN112101776B (en) Crowd-sourced task work group determining method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant