CN113283233A

CN113283233A - Text error correction method and device, electronic equipment and storage medium

Info

Publication number: CN113283233A
Application number: CN202110598406.4A
Authority: CN
Inventors: 门玉玲
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2021-08-20
Anticipated expiration: 2041-05-28
Also published as: CN113283233B

Abstract

The application relates to the technical field of natural language processing, and particularly discloses a text error correction method, a text error correction device, electronic equipment and a storage medium, wherein the error correction method comprises the following steps: replacing a first character in the recognized text with a second character; adding an identifier to a second character obtained by replacing the first character in the recognition text; determining a second character to be corrected in the second characters with the identifications in the recognition text according to adjacent characters of the second characters with the identifications in the recognition text; acquiring the characteristics of a second character to be corrected; and replacing second characters matched with the characteristics of the second characters to be corrected in the second characters with the marks in the recognition text with the first characters to obtain the corrected recognition text. By the text error correction method, automatic error correction of the recognition errors in the recognized text can be realized, consumption of human resources is greatly reduced, error correction efficiency is improved, and accuracy of automatic error correction is guaranteed.

Description

Text error correction method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of natural language processing, in particular to a text error correction method, a text error correction device, electronic equipment and a storage medium.

Background

Optical Character Recognition (OCR) is a process of electronically scanning an input image and extracting characters therefrom, and has a higher input speed than manual Character typing, can save a large amount of human resources, and can also be used for automatic Recognition in various fields, such as license plate Recognition, identification card Recognition, bank card Recognition and the like.

However, OCR can experience fixed errors in the recognition of certain characters. Therefore, manual error correction is required for the text after OCR. For example, in the case of OCR recognition in converting a PDF file into a Word file by adobe acrobat dc, the "base" is always recognized as "crystal", "mountain base", and "basic" as "generation base" in the OCR recognition. Therefore, in the process of converting a large number of files, the proofreading personnel needs to repeatedly modify such errors, a large amount of manpower is needed, and the error correction efficiency is low.

Disclosure of Invention

In order to solve the above problems in the prior art, embodiments of the present application provide a text error correction method, apparatus, electronic device, and storage medium, which can automatically identify and replace a wrong character, and automatically rollback a place where the wrong character is replaced, thereby reducing human consumption and improving error correction efficiency.

In a first aspect, an embodiment of the present application provides a text error correction method, including:

replacing a first character in the recognized text with a second character;

adding an identifier to a second character obtained by replacing the first character in the recognition text;

determining a second character to be corrected in the second characters with the identifications in the recognition text according to adjacent characters of the second characters with the identifications in the recognition text;

acquiring the characteristics of a second character to be corrected;

and replacing second characters matched with the characteristics of the second characters to be corrected in the second characters with the marks in the recognition text with the first characters to obtain the corrected recognition text.

In a second aspect, an embodiment of the present application provides a text correction apparatus, including:

the character replacing module is used for replacing a first character in the recognized text with a second character;

the character identification module is used for adding identification to a second character obtained by replacing the first character in the recognized text;

the character determining module is used for determining a second character to be corrected in the second character with the identification in the recognition text according to the adjacent character of the second character with the identification in the recognition text;

the characteristic determining module is used for acquiring the characteristics of the second character to be corrected;

and the character replacing module is also used for replacing second characters which are matched with the characteristics of the second characters to be corrected in the second characters with the identification in the recognition text with the first characters to obtain the corrected recognition text.

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor coupled to the memory, the memory for storing a computer program, the processor for executing the computer program stored in the memory to cause the electronic device to perform the method of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having a computer program stored thereon, the computer program causing a computer to perform the method according to the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program, the computer operable to cause the computer to perform a method according to the first aspect.

The implementation of the embodiment of the application has the following beneficial effects:

in the embodiment of the application, the first character in the recognized text is modified into the second character by automatically recognizing the first character to be modified and automatically determining the modified second character. Then, by adding an identifier to a second character obtained by replacing the first character in the recognized text, determining a place where the error is replaced in the replacement process, and automatically rolling back, namely replacing the second character with the error in modification with the first character. Therefore, automatic error correction of recognition errors in the recognition texts can be realized, manual error correction is not needed, consumption of human resources is greatly reduced, error correction efficiency is improved, and accuracy of automatic error correction is guaranteed.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic diagram of a hardware structure of a text error correction apparatus according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a text error correction method according to an embodiment of the present application;

fig. 3 is a flowchart illustrating a method for determining a second character from a plurality of candidate characters according to adjacent characters of a first character according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a method for matching each first word vector of a plurality of first word vectors with word vectors of a plurality of template words in a preset lexicon to obtain a plurality of first matching results corresponding to the plurality of template words one by one, and determining a second character according to the plurality of first matching results according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of a method for determining a fourth term from among at least one third term provided in the embodiments of the present application;

fig. 6 is a flowchart illustrating a method for determining a second character to be corrected in a recognized text according to adjacent characters of the recognized text having the identified second character according to an embodiment of the present application;

FIG. 7 is a flowchart illustrating a method for obtaining features of a second character to be corrected according to an embodiment of the present disclosure;

fig. 8 is a block diagram illustrating functional modules of a text error correction apparatus according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments in the present application are within the scope of protection of the present application.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, result, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

Referring to fig. 1, fig. 1 is a schematic diagram of a hardware structure of a text error correction apparatus according to an embodiment of the present disclosure. The text correction device 100 includes at least one processor 101, a communication line 102, a memory 103, and at least one communication interface 104.

In this embodiment, the processor 101 may be a general processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more ics for controlling the execution of programs according to the present disclosure.

The communication link 102, which may include a path, carries information between the aforementioned components.

The communication interface 104 may be any transceiver or other device (e.g., an antenna, etc.) for communicating with other devices or communication networks, such as an ethernet, RAN, Wireless Local Area Network (WLAN), etc.

The memory 103 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

In this embodiment, the memory 103 may be independent and connected to the processor 101 through the communication line 102. The memory 103 may also be integrated with the processor 101. The memory 103 provided in the embodiments of the present application may generally have a nonvolatile property. The memory 103 is used for storing computer-executable instructions for executing the scheme of the application, and is controlled by the processor 101 to execute. The processor 101 is configured to execute computer-executable instructions stored in the memory 103, thereby implementing the methods provided in the embodiments of the present application described below.

In alternative embodiments, computer-executable instructions may also be referred to as application code, which is not specifically limited in this application.

In alternative embodiments, processor 101 may include one or more CPUs, such as CPU0 and CPU1 of FIG. 1.

In an alternative embodiment, the text correction apparatus 100 may include a plurality of processors, such as the processor 101 and the processor 107 in FIG. 1. Each of these processors may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

In an alternative embodiment, if the text correction apparatus 100 is a server, the text correction apparatus 100 may further include an output device 105 and an input device 106. The output device 105 is in communication with the processor 101 and may display information in a variety of ways. For example, the output device 105 may be a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display device, a Cathode Ray Tube (CRT) display device, a projector (projector), or the like. The input device 106 is in communication with the processor 101 and may receive user input in a variety of ways. For example, the input device 106 may be a mouse, a keyboard, a touch screen device, or a sensing device, among others.

The text correction apparatus 100 may be a general-purpose device or a special-purpose device. The embodiment of the present application does not limit the type of the text correction device 100.

Referring to fig. 2, fig. 2 is a schematic flowchart of a text error correction method according to an embodiment of the present disclosure. The text error correction method comprises the following steps:

201: the first character in the recognized text is replaced with the second character.

In the present embodiment, the first character refers to an erroneous character generated due to a recognition error in the OCR recognition process, and the second character refers to an original correct character corresponding to the erroneous character generated due to the recognition error. For example, in an OCR recognition process, the original "base" is recognized as the "generation base", then the "generation" is the first character and the original correct "base" is the second character.

In this embodiment, before step 201 is executed, a first character is determined in the recognized text, and a second character is determined according to the first character. Illustratively, the embodiment provides a method for determining a first character in a recognition text, which specifically includes the following steps:

and determining the first character to be modified in the recognition text according to a preset error table. The error table is a table that records characters whose recognition error rate is greater than a first threshold in OCR recognition. For example, in OCR recognition, "base" is often recognized as "thousand", "from" as "mountain," and "base" as "generation base", so that the characters "base", and the like are common recognition error characters and recorded in the error table.

Based on this, the present embodiment further provides a method for determining a second character according to a first character, which is as follows:

first, a character group corresponding to the first character is determined according to an error table.

Since the error recognition is usually fixed in the OCR recognition, in the present embodiment, the error table records not only the character having the recognition error rate greater than the first threshold value during the OCR recognition but also the character group corresponding to the character. The character set includes a plurality of candidate characters corresponding to the character, and a probability that the plurality of candidate characters are recognized as the character is greater than a second threshold in the OCR recognition process.

For example, in the OCR recognition, the characters such as the character "base", "even", "it", "plug", etc. are often recognized erroneously. Thus, for the error character "die", its corresponding character set is "base", "even", "it", "plug".

Then, a second character is determined among the plurality of candidate characters based on the adjacent characters of the first character.

For example, there is provided a method of determining a second character among a plurality of candidate characters based on adjacent characters of a first character, as shown in fig. 3, the method including:

301: a third character is obtained.

Illustratively, the third character is a character adjacent to the first character in the recognized text, for example, the context of the erroneous character (i.e., the first character) may determine which character (i.e., the second character) in the character set the erroneous character is modified to. Specifically, for a certain sentence "the sewer pipe network is a crystal infrastructure of a city" in the recognition result, the first character is "crystal" through the comparison of the error table, and the "foundation" of the previous character (the character adjacent to the left of the first character) "and the" foundation "of the next character (the character adjacent to the right of the first character)" of the "crystal" are respectively obtained as the third character.

302: and combining the third character and the first character according to the sequence in the recognition text to obtain a first word.

Illustratively, following the example above where the sewer network is the crystal infrastructure of a city, the first character is determined to be "crystal" and the third character is the previous character "of" crystal "and the next character" foundation ". Therefore, combining the third character and the first character in the order in the recognition text can result in the crystal of the first word "and the crystal foundation".

303: and respectively replacing the first character in the first word with each candidate character in the candidate characters to obtain a plurality of second words corresponding to the candidate characters one by one.

For example, following the above example that "the sewer pipe network is a crystal infrastructure of a city", by looking up the error table, the character set [ "base", "even", "it", "plug" ] corresponding to the character "crystal" can be obtained, and the characters in the character set are respectively used to replace the character "crystal" in the first term, so that the second term can be obtained: "base", "very", "its", "plug", "base", "very base", "its base", and "plug base".

304: and respectively carrying out word embedding processing on each second word in the plurality of second words to obtain a plurality of first word vectors which are in one-to-one correspondence with the plurality of second words.

In an alternative embodiment, a method of determining a first word vector for a second word is provided, as follows: firstly, determining a word vector of each character in the composed second words, and then splicing the word vectors of each character according to the position of each character in the composed second words. Specifically, as for the second word "basis" of the composition, first, a word vector a of the character "basis" and a word vector B of the character "basis" are determined, and then, the positional relationship between the character "basis" and the character "basis" is determined, and since the character "basis" is located on the left side of the character "basis", the result after concatenation is C ═ a, B. And finally, taking the splicing result as a first word vector of the second word.

305: and for each first word vector in the plurality of first word vectors, matching each word vector with word vectors of a plurality of template words in a preset word bank respectively to obtain a plurality of first matching results corresponding to the plurality of template words one by one, and determining a second character according to the plurality of first matching results.

In this embodiment, a method for matching each first word vector in a plurality of first word vectors with word vectors of a plurality of template words in a preset lexicon to obtain a plurality of first matching results corresponding to the plurality of template words one to one, and determining a second character according to the plurality of first matching results is provided, as shown in fig. 4, the method includes:

401: and respectively calculating a first similarity between each first word vector and the word vector of each template word in a plurality of template words in a preset word bank to obtain a plurality of first similarities corresponding to the plurality of common words one by one.

In the present embodiment, the similarity is used as a matching result between each first word vector and the word vectors of a plurality of template words in the preset lexicon. The preset word bank stores a plurality of common template words with actual semantics, so that whether each first word vector corresponds to a word with actual semantics or not can be quickly determined by calculating the similarity between each first word vector and the word vector of each template word in the plurality of template words in the preset word bank.

For example, the similarity between each first word vector and the word vector of each template word may be calculated by calculating an angle cosine value between each first word vector and the word vector of each template word in the plurality of template words, and using the angle cosine value as the similarity between each first word vector and the word vector of each template word.

Specifically, each first word vector a ═ a1, a2, …, ai, …, an ], and each template word's word vector B ═ B1, B2, …, bi, …, bn ], where i ═ 1, 2, …, n.

Based on this, the cosine value of the included angle can be expressed by the formula (II):

where a · B represents the inner product of each first word vector a and the word vector B of each template word, | is a modulo symbol, | a | represents the modulus of each first word vector a, | B | represents the modulus of the word vector B of each template word.

Further, the inner product of each first word vector a and the word vector B of each template word can be represented by formula (c):

further, the modulus of each first word vector a can be expressed by the formula (iv):

and finally, taking the cosine value of the included angle as the similarity between the word vectors of each first word vector and each template word. For example, the similarity between each first word vector and the word vector of each template word may be represented by the formula (v):

d＝cosθ…………⑤

because the value range of the cosine value is (1, 1), the cosine value still has the properties of 1 when the cosine value is the same, 0 when the cosine value is orthogonal and-1 when the cosine value is opposite under the condition of high dimension. That is, the closer the cosine value approaches 1, the closer the directions representing the two vectors are; the closer they approach-1, the more opposite their direction; close to 0, indicating that the two vectors are nearly orthogonal, can suggest a relative difference in the direction of the two vectors. Therefore, by using the cosine value as the similarity between each first word vector and the word vector of each template word, the similarity between each first word vector and the word vector of each template word can be accurately represented.

402: and taking the maximum first similarity in the plurality of first similarities as the first score of each first word vector to obtain a plurality of first scores corresponding to the plurality of first word vectors one by one.

403: and taking a second word corresponding to the first word vector corresponding to the first score larger than the third threshold value in the plurality of first scores as a third word to obtain at least one third word.

Therefore, the words with actual semantics in the second words can be screened out.

404: and determining a fourth word in the at least one third word, and taking the character which is positioned at the same position as the first character in the fourth word as the second character.

In this embodiment, if only one third word is obtained after the screening in step 503, it can be directly determined that the third word is the fourth word. If a plurality of third words are obtained, the matching degree between each third word and the text to be recognized can be determined, and the third word with the highest matching degree is taken as the fourth word.

For example, in the present embodiment, a method for determining a fourth word in at least one third word is provided, as shown in fig. 5, the method includes:

501: in the recognition text, a sentence where the first word is located is obtained, and a first sentence is obtained.

502: and for each third word in the at least one third word, replacing the first word in the first sentence with each third word respectively to obtain a plurality of second sentences which are in one-to-one correspondence with the third words.

503: and respectively calculating the confusion degree of each second sentence in the plurality of second sentences to obtain a plurality of confusion degrees which are in one-to-one correspondence with the plurality of second sentences.

In the present embodiment, the confusion is used to describe how well one sentence is. In general, the better a sentence is, the less probability of confusion when one understands the sentence. Conversely, the smaller the confusion degree of a sentence is, the more fluent the grammar and the smooth the semantics of the sentence are, and the sentence belongs to a good sentence.

504: and taking a third word corresponding to the second sentence corresponding to the smallest confusion degree in the plurality of confusion degrees as a fourth word.

Therefore, the optimal second character can be obtained, and the replaced recognition text has good grammar fluency and semantic fluency.

202: and adding an identifier to a second character obtained by replacing the first character in the recognized text.

In this embodiment, the above operation of replacing all the first characters in the recognized text with the second character may modify the originally correct first character into the second character, introducing new errors. Illustratively, following the example where the sewer network is a city's infrastructure, all the characters "crystal" in the recognized text are replaced with the characters "base" through the process of step 201, and at this time, if the word "crystal" exists in the text to be recognized, the word becomes "base" through the above modification, thereby introducing new errors.

Therefore, in the present embodiment, it is also necessary to add a mark to the second character obtained by replacing the first character in the recognized text. For example, the identification may be performed by highlighting, so as to distinguish the second character obtained by replacing the first character from the second character originally existing in the recognized text.

203: and determining the second character to be corrected in the second character with the identification in the recognized text according to the adjacent characters of the second character with the identification in the recognized text.

By way of example, the present application provides a method for determining a second character to be corrected in a recognized text from adjacent characters of the recognized text having the identified second character, as shown in fig. 6, the method comprising:

601: the fourth character is obtained.

Illustratively, the fourth character is an adjacent character of the recognized text with the identified second character, and the second character to be corrected in the identified second character can be determined by the context of the identified second character. Specifically, a sentence "glittering and translucent dewdrops are condensed on petals" in the text after the first error correction, and the character "base" is the second character to be replaced at the time of the first error correction and is highlighted. The "on" character of the previous character (the character left-adjacent to the second character to be replaced at the time of the first error correction) and the "off" character of the next character (the character right-adjacent to the second character to be replaced at the time of the first error correction) are acquired as the fourth character.

602: and combining the fourth character and the second character with the identification in the recognition text according to the sequence in the recognition text to obtain a fifth word.

Illustratively, following the example of "glittering and translucent dew condensed on petals" above, the second character with the logo is identified as "base", and the fourth character is the previous character "glittering" and the next character "glittering" of the "base". Therefore, the fifth words "face" and "sparkle" can be obtained by combining the fourth character and the second character with the logo in the order in the recognized text.

603: and performing word embedding processing on the fifth word to obtain a second word vector.

In this embodiment, the method for obtaining the second word vector is similar to the method for obtaining the first word vector in step 404, and is not repeated herein.

604: and respectively calculating second similarity between the second word vector and the word vector of each template word in the plurality of template words in the preset word bank to obtain a plurality of second similarities which are in one-to-one correspondence with the plurality of template words.

605: and taking the largest second similarity in the plurality of second similarities as a second score of the second word vector.

In this embodiment, the method for obtaining the second similarity is similar to the method for obtaining the first similarity in step 401, and is not repeated herein.

606: and when the second score is smaller than a fourth threshold value, determining that a second character in a fifth word corresponding to the second word vector corresponding to the second score is a second character to be corrected.

In this embodiment, the second score is smaller than the fourth threshold, which indicates that the fifth term corresponding to the second word vector is a term without actual semantics, in colloquial, that is, the fifth term does not belong to terms defined in conventional writing. Therefore, the second character in the fifth word corresponding to the second word vector corresponding to the second score can be determined as the second character to be corrected.

204: the characteristics of the second character to be corrected are obtained.

In this embodiment, after the second character to be corrected is determined, the feature of the second character to be corrected may be determined by determining the adjacent character of the second character to be corrected. Illustratively, the present application provides a method of obtaining a feature of a second character to be error corrected, as shown in fig. 7, the method comprising:

701: the fifth character is obtained.

Illustratively, the fifth character is a character adjacent to the second character to be corrected in the recognized text, and the characteristic of the second character to be corrected can be determined by the context of the second character to be corrected, for example. Specifically, for a certain sentence "the sewer pipe network is a crystal infrastructure of a city" in the recognition result, after the determination of the method, the second character to be corrected is obtained as "crystal", and the previous character (the character adjacent to the left of the first character) "and the next character (the character adjacent to the right of the first character)" foundation "of the character" crystal "are respectively obtained as the fifth character.

702: and acquiring a word vector of the fifth character and acquiring a word vector of the second character to be corrected.

703: and splicing the word vector of the fifth character and the word vector of the second character to be corrected according to the sequence of the fifth character and the second character to be corrected in the recognition text to obtain a fusion vector.

Illustratively, following the above example where the "sewer network is a crystal infrastructure of a city", the word vector of the second character "crystal" to be corrected is [1, 2], the word vector of the fifth character "is [3, 4], and the word vector of the fifth character" infrastructure "is [5, 6 ]. And transversely splicing the obtained word vectors according to the crystal foundation of the sequence of the second character and the fifth character to be corrected in the original text to obtain a fusion vector [3, 4, 1, 2, 5, 6 ].

704: and taking the fusion vector as the characteristic of the second character to be corrected.

205: and replacing second characters matched with the characteristics of the second characters to be corrected in the second characters with the marks in the recognition text with the first characters to obtain the corrected recognition text.

In this embodiment, after determining the features of the second character to be corrected, the identified second character with similarity greater than the fifth threshold may be determined as the second character matching the features of the second character to be corrected by obtaining the features of each identified second character, calculating the similarity between the features of the second character to be corrected and the features of each identified second character.

The method for obtaining the feature of each identified second character is similar to the method for obtaining the feature of the second character to be corrected in step 204, and the method for calculating the similarity between the feature of the second character to be corrected and the feature of each identified second character is similar to the method for calculating the first similarity between each first word vector in step 401 and the word vector of each template word in the plurality of template words in the preset lexicon, and therefore, the description thereof is omitted.

In summary, the text error correction method provided by the present invention automatically identifies the first character to be modified and automatically determines the modified second character, so as to modify the first character in the recognized text into the second character, thereby implementing automatic error correction of the recognized text. And then, the second character obtained by replacing the first character in the recognition text is added with the identifier, the place where the error is replaced in the replacement process is determined, and automatic rollback is carried out, so that the majority of recognition errors in the recognition text can be corrected by modifying twice at most, the consumption of human resources is greatly reduced, the correction efficiency is improved, and the accuracy of automatic correction is ensured.

Referring to fig. 8, fig. 8 is a block diagram illustrating functional modules of a text error correction apparatus according to an embodiment of the present disclosure. As shown in fig. 8, the text correction apparatus 800 includes:

a character replacement module 801, configured to replace a first character in the recognized text with a second character;

a character identification module 802, configured to add an identifier to a second character obtained by replacing a first character in the recognized text;

a character determining module 803, configured to determine, according to adjacent characters of the identified second character in the recognition text, a second character to be corrected in the identified second character in the recognition text;

a feature determining module 804, configured to obtain a feature of a second character to be error corrected;

the character replacing module 801 is further configured to replace a second character, which is matched with the feature of the second character to be corrected, in the second character with the identifier in the recognition text with the first character, so as to obtain the corrected recognition text.

In an embodiment of the present invention, before replacing the first character in the recognized text with the second character, the character determining module 803 is further configured to:

determining a character group corresponding to the first character according to an error table, wherein the error table records characters with the recognition error rate larger than a first threshold value and the character group corresponding to the characters in the OCR recognition process, the character group comprises a plurality of candidate characters corresponding to the characters, and the probability of the candidate characters being recognized as the characters in the OCR recognition process is larger than a second threshold value;

a second character is determined among the plurality of candidate characters based on the adjacent characters to the first character.

In an embodiment of the present invention, in determining a second character among a plurality of candidate characters according to adjacent characters of a first character, the character determining module 803 is specifically configured to:

acquiring a third character, wherein the third character is a character adjacent to the first character in the identification text;

combining the third character and the first character according to the sequence in the recognition text to obtain a first word;

replacing the first character in the first word with each candidate character in a plurality of candidate characters respectively to obtain a plurality of second words, wherein the plurality of second words are in one-to-one correspondence with the plurality of candidate characters;

performing word embedding processing on each second word in the plurality of second words respectively to obtain a plurality of first word vectors, wherein the plurality of first word vectors correspond to the plurality of second words one to one;

and for each first word vector in the plurality of first word vectors, matching each first word vector with word vectors of a plurality of template words in a preset word bank to obtain a plurality of first matching results, and determining a second character according to the plurality of first matching results, wherein the plurality of first matching results correspond to the plurality of template words one to one.

In an embodiment of the present invention, in terms of matching, for each first word vector in the plurality of first word vectors, each first word vector with word vectors of a plurality of template words in a preset lexicon to obtain a plurality of matching results, and determining a second character according to the plurality of matching results, the character determining module 803 is specifically configured to:

respectively calculating a first similarity between each first word vector and a word vector of each template word in a plurality of template words in a preset word bank to obtain a plurality of first similarities, wherein the plurality of first similarities correspond to the plurality of common words one to one;

taking the maximum first similarity in the plurality of first similarities as a first score of each first word vector to obtain a plurality of first scores, wherein the plurality of first scores correspond to the plurality of first word vectors one to one;

taking a second word corresponding to the first word vector corresponding to the first score larger than a third threshold value in the plurality of first scores as a third word to obtain at least one third word;

and determining a fourth word in the at least one third word, and taking the character which is at the same position as the first character in the fourth word as a second character, wherein the fourth word is the word which is matched with the recognition text to the highest degree in the at least one third word.

In an embodiment of the present invention, in determining the fourth word in the at least one third word, the character determining module 803 is specifically configured to:

in the identification text, obtaining a sentence where a first word is located to obtain a first sentence;

for each third word in the at least one third word, replacing the first word in the first sentence with each third word to obtain a plurality of second sentences, wherein the plurality of second sentences correspond to the plurality of third words one by one;

respectively calculating the confusion degree of each second sentence in the plurality of second sentences to obtain a plurality of confusion degrees, wherein the confusion degrees correspond to the second sentences one by one;

and taking a third word corresponding to the second sentence corresponding to the smallest confusion degree in the plurality of confusion degrees as a fourth word.

In an embodiment of the present invention, in determining the second character to be corrected in the recognized text according to the adjacent characters of the recognized text with the identified second character, the character determining module 803 is specifically configured to:

acquiring a fourth character, wherein the fourth character is an adjacent character of a second character with an identification in the identification text;

combining the fourth character and the second character with the identification in the recognition text according to the sequence in the recognition text to obtain a fifth word;

performing word embedding processing on the fifth word to obtain a second word vector;

respectively calculating a second similarity between the second word vector and the word vector of each template word in a plurality of template words in a preset word bank to obtain a plurality of second similarities, wherein the plurality of second similarities correspond to the plurality of template words one to one;

taking the largest second similarity in the plurality of second similarities as a second score of the second word vector;

and when the second score is smaller than a fourth threshold value, determining that a second character in a fifth word corresponding to the second word vector corresponding to the second score is a second character to be corrected.

In an embodiment of the present invention, in terms of obtaining a feature of the second character to be corrected, the feature determining module 804 is specifically configured to:

acquiring a fifth character, wherein the fifth character is an adjacent character of the second character to be corrected;

acquiring a word vector of a fifth character, and acquiring a word vector of a second character to be corrected;

splicing the word vector of the fifth character and the word vector of the second character to be corrected according to the sequence of the fifth character and the second character to be corrected in the recognition text to obtain a fusion vector;

and taking the fusion vector as the characteristic of the second character to be corrected.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 9, the electronic device 900 includes a transceiver 901, a processor 902, and a memory 903. Connected to each other by a bus 904. The memory 903 is used to store computer programs and data, and may transfer the data stored in the memory 903 to the processor 902.

The processor 902 is configured to read the computer program in the memory 903 to perform the following operations:

replacing a first character in the recognized text with a second character;

acquiring the characteristics of a second character to be corrected;

In an embodiment of the present invention, before replacing the first character in the recognized text with the second character, the processor 902 is further configured to:

In an embodiment of the present invention, in determining a second character among a plurality of candidate characters according to a neighboring character of a first character, the processor 902 is specifically configured to:

In an embodiment of the present invention, in terms of matching, for each first word vector in the plurality of first word vectors, each word vector with word vectors of a plurality of template words in a preset lexicon to obtain a plurality of matching results, and determining a second character according to the plurality of matching results, the processor 902 is specifically configured to perform the following operations:

In an embodiment of the present invention, in determining the fourth term among the at least one third term, the processor 902 is specifically configured to perform the following operations:

In an embodiment of the present invention, in determining a second character to be corrected in the recognized text according to adjacent characters of the recognized text with the identified second character, the processor 902 is specifically configured to:

acquiring a fourth character, wherein the fourth character is an adjacent character of a second character with an identification in the identification text; combining the fourth character and the second character with the identification in the recognition text according to the sequence in the recognition text to obtain a fifth word;

In an embodiment of the present invention, in terms of obtaining the feature of the second character to be error corrected, the processor 902 is specifically configured to perform the following operations:

It should be understood that the text error correction device in the present application may include a smart Phone (e.g., an Android Phone, an iOS Phone, a Windows Phone, etc.), a tablet computer, a palm computer, a notebook computer, a Mobile Internet device MID (MID), a robot, a wearable device, etc. The text error correction device is merely exemplary and not exhaustive, and includes but is not limited to the text error correction device. In practical applications, the text error correction apparatus may further include: intelligent vehicle-mounted terminal, computer equipment and the like.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention can be implemented by combining software and a hardware platform. With this understanding in mind, all or part of the technical solutions of the present invention that contribute to the background can be embodied in the form of a software product, which can be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes instructions for causing a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments or some parts of the embodiments.

Accordingly, the present application also provides a computer readable storage medium, which stores a computer program, wherein the computer program is executed by a processor to implement part or all of the steps of any one of the text error correction methods as described in the above method embodiments. For example, the storage medium may include a hard disk, a floppy disk, an optical disk, a magnetic tape, a magnetic disk, a flash memory, and the like.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any one of the text error correction methods as described in the above method embodiments.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are all alternative embodiments and that the acts and modules referred to are not necessarily required by the application.

In the above embodiments, the description of each embodiment has its own emphasis, and for parts not described in detail in a certain embodiment, reference may be made to the description of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is merely a logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, and the memory may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the methods and their core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for correcting text, the method comprising:

replacing a first character in the recognized text with a second character;

determining a second character to be corrected in the second character with the identification in the recognition text according to the adjacent character of the second character with the identification in the recognition text;

acquiring the characteristics of the second character to be corrected;

and replacing second characters which are matched with the characteristics of the second characters to be corrected in the second characters with the identifiers in the recognition text with the first characters to obtain the corrected recognition text.

2. The error correction method of claim 1, wherein before the replacing the first character in the recognized text with the second character, the error correction method further comprises:

determining a character group corresponding to the first character according to an error table, wherein the error table records characters with recognition error rates larger than a first threshold value and a character group corresponding to the characters in an OCR (optical character recognition) process, the character group comprises a plurality of candidate characters corresponding to the characters, and the probability that the candidate characters are recognized as the characters in the OCR process is larger than a second threshold value;

determining the second character among the plurality of candidate characters according to adjacent characters of the first character.

3. The error correction method of claim 2, wherein said determining the second character among the plurality of candidate characters based on the characters adjacent to the first character comprises:

acquiring a third character, wherein the third character is a character adjacent to the first character in the recognition text;

replacing a first character in the first word with each candidate character in the candidate characters respectively to obtain a plurality of second words, wherein the second words are in one-to-one correspondence with the candidate characters;

and matching each first word vector in the plurality of first word vectors with word vectors of a plurality of template words in a preset word bank to obtain a plurality of first matching results, and determining the second character according to the plurality of first matching results, wherein the plurality of first matching results are in one-to-one correspondence with the plurality of template words.

4. The error correction method according to claim 3, wherein the matching, for each of the plurality of first word vectors, the first word vector with word vectors of a plurality of template words in a preset lexicon to obtain a plurality of matching results, and determining the second character according to the plurality of matching results comprises:

respectively calculating a first similarity between each first word vector and a word vector of each template word in a plurality of template words in the preset word bank to obtain a plurality of first similarities, wherein the plurality of first similarities correspond to the plurality of common words one to one;

taking the largest first similarity in the plurality of first similarities as a first score of each first word vector to obtain a plurality of first scores, wherein the plurality of first scores correspond to the plurality of first word vectors one to one;

taking a second word corresponding to the first word vector corresponding to the first score which is greater than a third threshold value in the plurality of first scores as a third word to obtain at least one third word;

and determining a fourth word in the at least one third word, and taking a character in the same position as the first character in the fourth word as the second character, wherein the fourth word is the word with the highest matching degree with the recognition text in the at least one third word.

5. The error correction method of claim 4, wherein said determining a fourth word among the at least one third word comprises:

in the recognition text, obtaining a sentence where the first word is located to obtain a first sentence;

for each third word in the at least one third word, replacing a first word in the first sentence with the each third word respectively to obtain a plurality of second sentences, wherein the plurality of second sentences correspond to the plurality of third words one by one;

respectively calculating the confusion degree of each second sentence in the plurality of second sentences to obtain a plurality of confusion degrees, wherein the confusion degrees are in one-to-one correspondence with the second sentences;

and taking a third word corresponding to the second sentence corresponding to the smallest confusion degree in the plurality of confusion degrees as the fourth word.

6. The error correction method according to any one of claims 1 to 5, wherein the determining the second character to be corrected in the identified second characters in the identified text according to the adjacent characters of the identified second character in the identified text comprises:

acquiring a fourth character, wherein the fourth character is an adjacent character of a second character with an identifier in the recognition text;

combining the fourth character with the second character with the identification in the recognition text according to the sequence in the recognition text to obtain a fifth word;

respectively calculating a second similarity between the second word vector and a word vector of each template word in a plurality of template words in the preset word bank to obtain a plurality of second similarities, wherein the plurality of second similarities correspond to the plurality of template words one to one;

and when the second score is smaller than a fourth threshold value, determining that a second character in a fifth word corresponding to a second word vector corresponding to the second score is the second character to be corrected.

7. The error correction method according to any one of claims 1 to 5, wherein the obtaining the feature of the second character to be error-corrected comprises:

acquiring a word vector of the fifth character, and acquiring a word vector of the second character to be corrected;

splicing the word vector of the fifth character and the word vector of the second character to be corrected according to the sequence of the fifth character and the second character to be corrected in the identification text to obtain a fusion vector;

8. A text correction apparatus, characterized in that the correction apparatus comprises:

the character identification module is used for adding identification to a second character obtained by replacing the first character in the recognition text;

the characteristic determining module is used for acquiring the characteristic of the second character to be corrected;

the character replacement module is further configured to replace a second character, which is matched with the feature of the second character to be corrected, in the second character with the identifier in the recognition text with the first character, so as to obtain the corrected recognition text.

9. An electronic device comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the one or more programs including instructions for performing the steps in the method of any of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executed by a processor to implement the method according to any one of claims 1-7.