CN108664471B - Character recognition error correction method, device, equipment and computer readable storage medium - Google Patents

Character recognition error correction method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN108664471B
CN108664471B CN201810430989.8A CN201810430989A CN108664471B CN 108664471 B CN108664471 B CN 108664471B CN 201810430989 A CN201810430989 A CN 201810430989A CN 108664471 B CN108664471 B CN 108664471B
Authority
CN
China
Prior art keywords
file
error correction
target
phrase
corrected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810430989.8A
Other languages
Chinese (zh)
Other versions
CN108664471A (en
Inventor
张远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yiyin Technology Co ltd
Shenzhen Lian Intellectual Property Service Center
Original Assignee
Beijing Yiyin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yiyin Technology Co ltd filed Critical Beijing Yiyin Technology Co ltd
Priority to CN201810430989.8A priority Critical patent/CN108664471B/en
Publication of CN108664471A publication Critical patent/CN108664471A/en
Application granted granted Critical
Publication of CN108664471B publication Critical patent/CN108664471B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a character recognition error correction method, a device, equipment and a computer readable storage medium, wherein the method comprises the following steps: when a file to be corrected is received, reading the extension name of the file to be corrected, and determining the attribute of the file to be corrected according to the extension name; judging whether the attribute of the file to be corrected is a read-only file or not, if the attribute of the file to be corrected is the read-only file, performing attribute conversion on the file to be corrected to generate an editable file; reading a plurality of keywords in the editable file to form a keyword group, and determining the target file type of the editable file according to the keyword group; and determining a target error correction library corresponding to the type of the target file according to the preset mapping relation between the type of the editable file and the error correction library, and calling the target error correction library to correct the editable file. According to the scheme, different error correction libraries are set according to different file types, and the target error correction library corresponding to the file type is used for correcting errors, so that error correction is more accurate, and error correction efficiency is improved.

Description

Character recognition error correction method, device, equipment and computer readable storage medium
Technical Field
The invention mainly relates to the technical field of intelligent recognition, in particular to a character recognition error correction method, a character recognition error correction device, character recognition error correction equipment and a computer readable storage medium.
Background
At present, many scenes need to convert character recognition in non-editable files (such as PDF and pictures) into editable files, and similar characters can be difficult to distinguish in the recognition process, so that wrongly written characters exist in the converted files, and at present, a recognition mechanism and an error correction mechanism are not available for the wrongly written characters after conversion; in addition, for wrongly written characters in the manual editing file, the recognition and error correction mechanism is not available, and only manual inspection can be performed, so that time and labor are wasted.
Disclosure of Invention
The invention mainly aims to provide a character recognition and error correction method, a device, equipment and a computer readable storage medium, which aim to solve the problem that in the prior art, a recognition and error correction mechanism is not available for wrongly written characters in a file.
In order to achieve the above object, the present invention provides a text recognition error correction method, which includes the following steps:
when a file to be corrected is received, reading an extension of the file to be corrected, and determining the attribute of the file to be corrected according to the extension;
Judging whether the attribute of the file to be corrected is a read-only file or not, if the attribute of the file to be corrected is the read-only file, performing attribute conversion on the file to be corrected to generate an editable file;
reading a plurality of keywords in the editable file to form a keyword group, and determining the target file type of the editable file according to the keyword group;
and determining a target error correction library corresponding to the target file type according to a preset mapping relation between the file type of the editable file and the error correction library, and calling the target error correction library to correct the editable file.
Preferably, the step of calling the target error correction library to correct the editable file includes:
identifying at least one sentence in the editable file, detecting a connecting word in each identified sentence, and dividing each sentence into a plurality of phrases to be identified according to the connecting word;
comparing the phrases to be identified with each preset phrase in the target error correction library one by one, and judging whether preset phrases consistent with the phrases to be identified exist in the target error correction library or not;
if the preset phrase consistent with the phrase to be identified does not exist in the target error correction library, acquiring a target preset phrase with highest similarity with the phrase to be identified in the target error correction library, and replacing the phrase to be identified with the target preset phrase.
Preferably, the step of replacing the phrase to be identified with the target preset phrase includes:
acquiring a phrase to be recognized adjacent to a current phrase to be recognized, forming a phrase to be recognized from the adjacent phrase to be recognized and the target preset phrase, and judging semantic scene matching of the target preset phrase and the editable file according to the phrase to be recognized;
and if the target preset phrase is matched with the editable file, replacing the phrase to be identified with the target preset phrase.
Preferably, the step of determining the target file type of the editable file according to the keyword group includes:
comparing the keyword group with a preset keyword group library, and determining a target keyword group in the preset keyword group library, wherein the element matching rate of the target keyword group and the keyword group is highest;
and determining a target file type corresponding to the target keyword group according to the mapping relation between the keyword group and the file type in the preset keyword group library, and determining the corresponding target file type as the target file type of the editable file.
Preferably, the step of converting the attribute of the file to be corrected to generate an editable file includes:
Scanning the file to be corrected, and determining a title and a paragraph in the file to be corrected according to the size relation and the interval relation among the characters in the file to be corrected;
scanning the title and the characters in the paragraphs one by one, identifying the scanned characters according to a preset character library, and adding a title identifier to the identified title characters;
and transmitting the identified title characters and paragraph characters to a preset editor to generate the editable file.
Preferably, the step of reading a plurality of keywords in the editable file to form a keyword group includes:
reading the phrase in the editable file, counting the occurrence frequency of each phrase, and taking the phrase with the frequency larger than a preset value as the keyword;
and acquiring the phrase in the title according to the title identifier, and forming a key phrase by the phrase in the title and the key word.
Preferably, the step of calling the target error correction library to correct the editable file includes:
outputting the edited file subjected to error correction, and transmitting correction words corresponding to correction operation to the target error correction library when the correction operation of the outputted edited file is received so as to update the target error correction library.
In addition, in order to achieve the above object, the present invention also provides a word recognition and error correction device, including:
the reading module is used for reading the extension name of the file to be corrected when the file to be corrected is received, and determining the attribute of the file to be corrected according to the extension name;
the judging module is used for judging whether the attribute of the file to be corrected is a read-only file or not, and if the attribute of the file to be corrected is the read-only file, performing attribute conversion on the file to be corrected to generate an editable file;
the determining module is used for reading a plurality of keywords in the editable file to form a keyword group and determining the target file type of the editable file according to the keyword group;
and the error correction module is used for determining a target error correction library corresponding to the target file type according to the preset mapping relation between the file type of the editable file and the error correction library, and calling the target error correction library to correct the editable file.
In addition, to achieve the above object, the present invention also proposes a text recognition and error correction apparatus, including: a memory, a processor, a communication bus, and a word recognition error correction program stored on the memory;
The communication bus is used for realizing connection communication between the processor and the memory;
the processor is used for executing the text recognition error correction program to realize the following steps:
when a file to be corrected is received, reading an extension of the file to be corrected, and determining the attribute of the file to be corrected according to the extension;
judging whether the attribute of the file to be corrected is a read-only file or not, if the attribute of the file to be corrected is the read-only file, performing attribute conversion on the file to be corrected to generate an editable file;
reading a plurality of keywords in the editable file to form a keyword group, and determining the target file type of the editable file according to the keyword group;
and determining a target error correction library corresponding to the target file type according to a preset mapping relation between the file type of the editable file and the error correction library, and calling the target error correction library to correct the editable file.
In addition, to achieve the above object, the present invention also provides a computer-readable storage medium storing one or more programs executable by one or more processors for:
When a file to be corrected is received, reading an extension of the file to be corrected, and determining the attribute of the file to be corrected according to the extension;
judging whether the attribute of the file to be corrected is a read-only file or not, if the attribute of the file to be corrected is the read-only file, performing attribute conversion on the file to be corrected to generate an editable file;
reading a plurality of keywords in the editable file to form a keyword group, and determining the target file type of the editable file according to the keyword group;
and determining a target error correction library corresponding to the target file type according to a preset mapping relation between the file type of the editable file and the error correction library, and calling the target error correction library to correct the editable file.
According to the character recognition error correction method, when a file to be corrected is received, an extension of the file to be corrected is read, and the attribute of the file to be corrected is determined according to the extension; judging whether the attribute of the file to be corrected is a read-only file or not, if so, performing attribute conversion on the file to be corrected to generate an editable file; reading a plurality of keywords in the editable file to form a keyword group, and determining the target file type of the editable file according to the keyword group; and determining a target error correction library corresponding to the type of the target file according to the preset mapping relation between the type of the editable file and the error correction library, and calling the target error correction library to correct the editable file. According to the scheme, the read-only file and the non-read-only file can be identified and corrected, when the file to be corrected is the read-only file, the file to be corrected is firstly converted into the editable file, the file type of the file is determined according to the key word group in the editable file, and the target error correction library corresponding to the file type of the file to be corrected is called for correcting errors. Because different file types belong to specific phrases in different industries, different error correction libraries are set according to the different file types, and the target error correction libraries corresponding to the file types are used for correcting errors, so that the error correction is more accurate, meanwhile, the manual error correction is avoided, and the error correction efficiency is improved.
Drawings
FIG. 1 is a flow chart of a first embodiment of the text recognition error correction method of the present invention;
FIG. 2 is a flow chart of a second embodiment of the text recognition error correction method of the present invention;
FIG. 3 is a schematic diagram of functional modules of a first embodiment of the word recognition and error correction apparatus of the present invention;
FIG. 4 is a schematic diagram of a device architecture of a hardware operating environment involved in a method according to an embodiment of the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The invention provides a character recognition and error correction method.
Referring to fig. 1, fig. 1 is a flowchart illustrating a text recognition error correction method according to a first embodiment of the present invention. In this embodiment, the text recognition error correction method includes:
step S10, when a file to be corrected is received, reading an extension of the file to be corrected, and determining the attribute of the file to be corrected according to the extension;
the character recognition and correction method is applied to a system server and is suitable for recognizing and correcting wrongly written characters in electronic files. The electronic file can be read-only file such as PDF, picture, etc., or can be editable file such as word, EXCEL, etc., and the electronic file which needs to be corrected is used as the file to be corrected. Because of the non-modifiable and modifiable differences between the read-only file and the editable file, when identifying and correcting the misplaced word of the two, the two needs to be performed according to the differences of the two. The files of different types have different extension names, when the files to be corrected are received, the extension names of the files to be corrected are read, and the attributes of the files to be corrected are determined according to the read extension names. The attribute of the file to be corrected indicates whether the file to be corrected belongs to a read-only file or an editable file, and a read-only extension library and an editable extension library are preset for determining the attribute according to the extension. Wherein the read-only extension library comprises extensions of various read-only type files, such as read-only extension library { pdf, jpg, png, bmp }; the editable extension library includes extensions possessed by various editable type files, such as the editable extension library { doc, txt, xls, ppt }. After reading the extension of the file to be corrected, it is determined whether the extension exists in a read-only extension library or an editable extension library. If the file exists in the read-only extension library, the attribute of the file to be corrected can be determined to be the read-only file, and if the file exists in the editable extension library, the attribute of the file to be corrected can be determined to be the editable file.
Step S20, judging whether the attribute of the file to be corrected is a read-only file, if the attribute of the file to be corrected is a read-only file, performing attribute conversion on the file to be corrected to generate an editable file;
further, since the read-only file has the characteristic of non-editable, the recognized wrongly written word cannot be corrected, and it is necessary to convert it into an editable file. And judging whether the attribute of the file to be corrected is a read-only file or not after determining the attribute of the file to be corrected, if the attribute of the file to be corrected is the read-only file, performing attribute conversion on the file to be corrected, and converting the read-only file to be corrected into an editable file to be corrected. During conversion, characters in the files to be corrected are identified, the identified characters are obtained from a character library, and the identified characters are transmitted to a character editor to generate the editable files. And for judging that the attribute of the file to be corrected is not a read-only file, namely, when the file to be corrected is an editable file, the attribute conversion is not needed, and the wrongly written characters of the editable file are directly recognized and corrected.
Step S30, reading a plurality of keywords in the editable file to form a keyword group, and determining the target file type of the editable file according to the keyword group;
As can be appreciated, documents in different industries have their characteristic phrases, such as "prosecution", "reported", "original notice" and other phrases in legal fields, and "bond", "financing", "deposit", "loan" and other phrases in financial industries; when the wrongly written or mispronounced words are identified, the phrases and sentences of various types are formed into an error correction library, and the identification is carried out through the error correction library. If the same error correction library is adopted for identifying and correcting wrongly written words of different types of files in all industries, the same error correction library comprises a large number of phrases and sentences, so that a large amount of noise is brought to the identified files, and the identification efficiency is reduced. In order to perform targeted identification on files in different industry fields, the embodiment classifies file types according to the industry fields, sets corresponding error correction libraries for the files in different types, and uses the error correction library in a certain industry field to correct errors of the files in the industry types, thereby improving error correction efficiency. After the editable file is generated, in order to correct errors using its corresponding error correction library, the file type to which it belongs needs to be determined. Since the file types are distinguished according to the industry field, for a file belonging to a certain industry field, it carries keywords related to the industry field, such as "prosecution", "reported", "original notice" and the like in the legal field. Therefore, a plurality of keywords carried in the file can be read to form a keyword group, the industry field to which the file belongs is determined through the keyword group containing the keywords, and the type of the target file to which the editable file belongs is further determined.
And S40, determining a target error correction library corresponding to the type of the target file according to the preset mapping relation between the type of the editable file and the error correction library, and calling the target error correction library to correct the editable file.
Furthermore, because the different types of editable files are provided with the corresponding error correction libraries, the mapping relation is preset between the file types of the editable files and the error correction libraries, and one file type corresponds to one error correction library. The preset mapping relation can be a key_value key value pair, the file type is used as a key, and the error correction library is used as a value. After determining the target file type of the editable file, determining a target error correction library corresponding to the target file type according to the preset mapping relation, inquiring the target error correction library serving as a value through the target file type serving as a key, and calling the target error correction library to realize error correction of the editable file. Specifically, the step of correcting errors includes:
step S41, identifying at least one sentence in the editable file, detecting a connecting word in each identified sentence, and dividing each sentence into a plurality of phrases to be identified according to the connecting word;
it is understood that the editable file includes a plurality of sentences, at least one sentence in the editable file is first identified when error correction identification is performed on the editable file, various types of punctuation marks are set as identification identifiers, the identification identifiers in the editable file are detected, and the content between the two identification identifiers is used as one sentence in the editable file. The set identification identifier includes elements {,. The method comprises the steps of carrying out a first treatment on the surface of the "": and when the content of the element and the content of the element are sentences in the editable file, continuing to detect until the next random element in the identification identifier is detected. After at least one sentence in the editable file is identified, the content in the sentence is further divided, division connecting words are set, connecting words in each identified sentence are detected, and the sentence is divided into a plurality of phrases to be identified according to the connecting words. Wherein the conjunctions include, but are not limited to: and, heel, AND, both, together, and, then, just, ground, just, then, just, additionally, just like such as, general, but, although, however, only, not, cause, because, by, and, or, also, if, unless, etc. When any one connecting word is detected to be included in the recognized sentence, continuing to detect until the next random connecting word in the sentence is detected, wherein the phrase between the two connecting words is the phrase to be recognized. If the continuous detection does not detect the connective, namely only one connective is detected in the sentence, dividing the sentence into two phrases to be recognized so as to recognize the divided phrases to be recognized later.
Step S42, comparing the phrases to be identified with each preset phrase in the target error correction library one by one, and judging whether preset phrases consistent with the phrases to be identified exist in the target error correction library or not;
further, a plurality of preset phrases corresponding to the industry field to which the editable file type belongs are arranged in the target error correction library, after the editable file is divided into a plurality of phrases to be identified, the phrases to be identified are compared with the preset phrases, and whether the preset phrases corresponding to the phrases to be identified exist in the target error correction library is judged. Because the preset phrase represents the accurate vocabulary in the industry field of the file, if the preset phrase corresponding to the phrase to be identified exists, the phrase to be identified is correct, and error correction is not needed.
Step S43, if the preset phrase consistent with the phrase to be identified does not exist in the target error correction library, acquiring a target preset phrase with highest similarity with the phrase to be identified in the target error correction library, and replacing the phrase to be identified with the target preset phrase.
If the target error correction library does not have the preset phrase consistent with the phrase to be recognized, the phrase to be recognized is possibly a wrong phrase and needs to be corrected. And when correction is carried out, acquiring a target preset phrase with highest similarity with the phrase to be identified in a target error correction library, wherein the similarity comprises two aspects of font shape similarity and semantic similarity, the shape similarity represents the phrase shape most probably possessed by the phrase to be identified, and the semantic similarity represents the semantic most probably possessed by the phrase to be identified by combining the semantics. When the font shape and the semantic meaning of the preset phrase are the highest in similarity with the phrase to be recognized, the preset phrase is explained to be the correct phrase of the phrase to be recognized most probably, and therefore the phrase to be recognized can be replaced by the target preset phrase. Before replacement, the semantic matching of the target preset phrase is required to be determined, the replacement is performed after the semantic matching, and the specific steps comprise:
Step S431, obtaining a phrase to be recognized adjacent to a current phrase to be recognized, forming a phrase to be recognized by the adjacent phrase to be recognized and the target preset phrase, and judging semantic scene matching of the target preset phrase and the editable file according to the phrase to be recognized;
understandably, for the same type of file, the semantic scene represented by the same type of file has consistency, and the sentence formed by the current phrase to be recognized and the phrases to be recognized adjacent to the current phrase to be recognized is consistent with the semantic scene of the editable file. And acquiring the phrases to be recognized, which are adjacent to the front and rear phrases to be recognized, wherein the phrases to be recognized are divided according to the connecting words when being divided, so that after the target preset phrases are placed in the phrases to be recognized, which are adjacent to the front and rear phrases to be recognized, the divided connecting words are added to form sentences to be recognized. And judging the matching property of the target preset phrase and the semantic scene of the editable file according to the consistency of the formed sentence to be identified and the semantic scene of the editable file. For the statement "legal rights and interests protected by company ", the words "and" acceptances "of the connective words are identified during division, and the phrases to be identified are" company "," legal rights and interests "and" legal rights and interests " protected". The method comprises the steps that a current phrase to be identified is protected by a method , no preset phrase corresponds to the current phrase, a target preset phrase legal protection with highest shape similarity is obtained from a target error correction library, when the target preset phrase is subjected to semantic matching, a phrase legal equity to be identified, which is adjacent to the target preset phrase in front of and behind the target preset phrase, is obtained, the current phrase legal protection to be identified is added to the adjacent phrase legal equity to be identified, and the divided connective words are combined to form a sentence legal equity legal protection to be identified. And judging the matching property of the target preset phrase legal protection and the semantic scene of the editable file according to the consistency of the sentence to be identified and the semantic scene of the editable file.
Step S432, if the target preset word group matches with the editable file, replacing the word group to be identified with the target preset word.
If the sentence to be identified is consistent with the semantic context of the editable file, the target preset phrase is illustrated to be matched with the editable file, and the target preset phrase is used for replacing the phrase to be identified so as to perform error correction processing on the phrase to be identified with the wrongly written word, and the error correction processing is performed on the phrase to be identified so as to correct the error correction processing into the target preset phrase. The accuracy of the replaced target preset phrase can be ensured by determining the target preset phrase corresponding to the phrase to be identified through both the shape and the semantics.
According to the character recognition error correction method, when a file to be corrected is received, an extension of the file to be corrected is read, and the attribute of the file to be corrected is determined according to the extension; judging whether the attribute of the file to be corrected is a read-only file or not, if so, performing attribute conversion on the file to be corrected to generate an editable file; reading a plurality of keywords in the editable file to form a keyword group, and determining the target file type of the editable file according to the keyword group; and determining a target error correction library corresponding to the type of the target file according to the preset mapping relation between the type of the editable file and the error correction library, and calling the target error correction library to correct the editable file. According to the scheme, the read-only file and the non-read-only file can be identified and corrected, when the file to be corrected is the read-only file, the file to be corrected is firstly converted into the editable file, the file type of the file is determined according to the key word group in the editable file, and the target error correction library corresponding to the file type of the file to be corrected is called for correcting errors. Because different file types belong to specific phrases in different industries, different error correction libraries are set according to the different file types, and the target error correction libraries corresponding to the file types are used for correcting errors, so that the error correction is more accurate, meanwhile, the manual error correction is avoided, and the error correction efficiency is improved.
Further, in another embodiment of the text recognition error correction method of the present invention, the step of determining the target file type of the editable file according to the keyword group includes:
step S31, comparing the keyword group with a preset keyword group library, and determining a target keyword group in the preset keyword group library, wherein the element matching rate of the target keyword group and the keyword group is highest;
it can be appreciated that the documents in different industry fields have different keywords, and in order to determine the target document type of the editable document according to the keyword group, the embodiment is provided with a preset keyword group library, and the preset keyword group library is preset to include the keyword groups in a plurality of industry fields. Such as preset keyword group library [ A, B, C ], that is, the preset keyword group library includes three keyword groups A, B, C, wherein keyword group a includes keywords a1, a2, a3, B1 and C1, keyword group B includes keywords a1, B2, B3 and C1, and keyword group C includes keywords a1, B1, C2 and C3. After forming a plurality of keywords into a keyword group library, comparing the keyword group with a preset keyword group library, and determining a target keyword group corresponding to the keyword group in the preset keyword group library, wherein the element matching rate of the target keyword group and the keyword group is highest. The actual fact that the element matching rate is highest is that the number of keyword matching between two rules is the highest. And comparing each keyword in the keyword group with the keywords of each keyword group in the preset keyword group library to determine the keyword group with the maximum number of the same keywords in the preset keyword group library. If the plurality of keywords forming the keyword group are a1, a2, b1 and d1, the number of the same keywords as the keyword group a is the largest, so that the keyword group a is determined as the target keyword group, and the element matching rate of the target keyword group and the keyword group of the editable file is the highest.
Step S32, determining a target file type corresponding to the target keyword group according to the mapping relation between the keyword group and the file type in the preset keyword group library, and determining the corresponding target file type as the target file type of the editable file.
Further, a mapping relation between each keyword group and the file type is set in the preset keyword group library, such as the preset keyword group library including three keyword groups A, B, C, where the three keyword groups map the file types a, b and c respectively. Therefore, after the target keyword group in the preset keyword group library is determined, the target file type corresponding to the target keyword group can be determined according to the mapping relation in the preset keyword group library. After the keyword group a is determined to be the target keyword group, the file type mapped by the target keyword group a in the preset keyword group library is a, so that the a is determined to be the target file type corresponding to the target keyword group. The target file type corresponding to the target keyword is the target file type of the editable file, so that the target file type of the editable file is determined according to the keyword group.
Further, in another embodiment of the text recognition error correction method of the present invention, the step of performing attribute conversion on the file to be corrected, and generating an editable file includes:
Step S21, scanning the file to be corrected, and determining a title and a paragraph in the file to be corrected according to the size relation and the interval relation among the characters in the file to be corrected;
further, consider that a title and a paragraph are typically included in a file, and that the text size and text spacing between the title and the paragraph are different, wherein the text of the title is larger than the text of the paragraph and the text spacing between the title and the paragraph is larger than the text spacing in the title and the paragraph. When the attribute conversion is carried out on the file to be corrected, the file to be corrected is scanned, and the title and the paragraph in the file to be corrected are determined according to the size relation and the interval relation among the characters in the file to be corrected, which are obtained by scanning. Specifically, when the scanned characters become smaller or the interval between the characters becomes larger, the content scanned before is judged to be a title; or the characters become smaller from smaller to larger, and the intervals among the characters become smaller and larger at first, and then the scanning is judged from paragraph to title and then paragraph. The title and the paragraph in the file to be corrected are distinguished through the size of the scanned characters and the change of the intervals between the characters.
Step S22, scanning the title and the characters in the paragraphs one by one, identifying the scanned characters according to a preset character library, and adding a title identifier to the identified title characters;
in order to convert the read-only file to be corrected into the editable file, the embodiment is provided with a preset text library, wherein the preset text library is preset and comprises various texts. After determining the title and the paragraph in the file to be corrected, scanning the characters in the title and the paragraph one by one, and comparing the scanned characters with the characters in a preset character library to identify the scanned characters. Wherein a title identifier is added to the identified title text for distinguishing the title text from the paragraph text.
Step S23, the identified title characters and paragraph characters are transmitted to a preset editor, and the editable file is generated.
Further, after the scanned title text and paragraph text are identified according to the preset text library, the identified title text and paragraph text are transmitted to a preset editor, wherein the preset editor is a tool preset to perform text editing, such as word documents, wps documents and the like, and the identified text is transmitted to the preset editor to perform editing, so that an editable file can be generated.
Further, in another embodiment of the text recognition error correction method of the present invention, the step of reading a plurality of keywords in the editable file to form a keyword group includes:
step S33, reading the phrase in the editable file, counting the occurrence frequency of each phrase, and taking the phrase with the frequency larger than a preset value as the keyword;
further, in order to enable the keyword group formed by the plurality of keywords to represent the industry field type to which the editable file belongs, the read keywords should be the word groups with more occurrence times in the editable file, so that the industry field to which the word groups with more occurrence times belong can determine the type of the editable file. Thus, the word groups in the editable file are read, the frequency of the occurrence of the word groups is counted, and the more the frequency is, the more the types of the editable file can be reflected. In addition, the universal connection words are considered to be universal in any industry field, and the types of the editable files cannot be embodied, so that the connection words are eliminated in statistics. In order to more accurately embody the types of the editable files, a preset value is set, and the phrase is used as a keyword only when the occurrence frequency of the phrase is larger than the preset value, so that the types of the editable files are reflected by the industry field to which the phrase with more occurrence frequency belongs.
Step S34, acquiring the phrase in the title according to the title identifier, and forming a keyword phrase together with the keyword in the title.
Understandably, the title content or the title type in the document can reflect the document type, such as the title content is "labor contract" can reflect the document type is the document of legal industry; and title types include "claims", "specifications", and the like, which reflect documents of the patent industry as document types. And then, taking the word groups with the frequency larger than a preset value in the editable file as key words, and further obtaining the word groups in the title. Because the title identifier is added to the title text, the title text can be determined according to the title identifier, and the phrase in the title text can be acquired. The word groups and the key words in the title are combined to form key word groups so as to more accurately reflect the types of the editable files.
Further, referring to fig. 2, a second embodiment of the text recognition error correction method according to the present invention is provided based on the first embodiment of the text recognition error correction method according to the present invention, in the second embodiment, the step of calling the target error correction library to correct the error of the editable file includes:
And S50, outputting the edited file subjected to error correction, and transmitting correction words corresponding to the correction operation to the target error correction library when the correction operation of the outputted edited file is received, so as to update the target error correction library.
And further, after the target error correction library is called to correct the wrongly written characters in the editable file, outputting the edited file subjected to error correction to an interface display of a terminal connected with the system server. Monitoring personnel monitoring the error correction result of the editable file detects the editable file displayed by the display interface, checks the correctness of the error correction result, and if the detected error correction result is correct, the error correction function in the target error correction library is indicated to be suitable for the current editable file; and when the error correction result is detected to be incorrect, the error correction function in the target error correction library is not suitable for the current editable file, and the target error correction library needs to be updated. And for the part with incorrect error correction result, the monitoring personnel can carry out correction operation, and correct correction words are input to correct the incorrect part in the editable file. When the correction operation of the output editable file is received, the correction words corresponding to the correction operation are transmitted to the target error correction library, and the target error correction library is updated. And when the error correction of the editable file is carried out through the target error correction library, calling the updated target error correction library, namely, the target error correction library containing correct correction words, to carry out error correction, and improving the error correction accuracy through repeated correction processes.
In addition, referring to fig. 3, the present invention provides a word recognition and error correction device, in a first embodiment of the word recognition and error correction device of the present invention, the word recognition and error correction device includes:
the reading module 10 is used for reading the extension name of the file to be corrected when the file to be corrected is received, and determining the attribute of the file to be corrected according to the extension name;
the judging module 20 is configured to judge whether the attribute of the file to be corrected is a read-only file, and if the attribute of the file to be corrected is a read-only file, perform attribute conversion on the file to be corrected to generate an editable file;
the determining module 30 is configured to read a plurality of keywords in the editable file, form a keyword group, and determine a target file type of the editable file according to the keyword group;
the error correction module 40 is configured to determine a target error correction library corresponding to the target file type according to a preset mapping relationship between the file type of the editable file and the error correction library, and call the target error correction library to correct the editable file.
In the text recognition error correction device of the embodiment, when a file to be corrected is received, the reading module 10 reads the extension of the file to be corrected, and determines the attribute of the file to be corrected according to the extension; the judging module 20 judges whether the attribute of the file to be corrected is a read-only file, if the attribute of the file to be corrected is a read-only file, the attribute conversion is carried out on the file to be corrected, and an editable file is generated; the determining module 30 reads a plurality of keywords in the editable file to form a keyword group, and determines the target file type of the editable file according to the keyword group; the error correction module 40 determines a target error correction library corresponding to the type of the target file according to the preset mapping relation between the type of the editable file and the error correction library, and invokes the target error correction library to correct the editable file. According to the scheme, the read-only file and the non-read-only file can be identified and corrected, when the file to be corrected is the read-only file, the file to be corrected is firstly converted into the editable file, the file type of the file is determined according to the key word group in the editable file, and the target error correction library corresponding to the file type of the file to be corrected is called for correcting errors. Because different file types belong to specific phrases in different industries, different error correction libraries are set according to the different file types, and the target error correction libraries corresponding to the file types are used for correcting errors, so that the error correction is more accurate, meanwhile, the manual error correction is avoided, and the error correction efficiency is improved.
The virtual function modules of the text recognition and error correction apparatus are stored in the memory 1005 of the text recognition and error correction device shown in fig. 4, and when the processor 1001 executes the text recognition and error correction program, the functions of the modules in the embodiment shown in fig. 3 are implemented.
Referring to fig. 4, fig. 4 is a schematic device structure of a hardware running environment related to a method according to an embodiment of the present invention.
The character recognition and error correction device in the embodiment of the invention can be a PC (personal computer ) or terminal devices such as a smart phone, a tablet personal computer, an electronic book reader, a portable computer and the like.
As shown in fig. 4, the text recognition error correction apparatus may include: a processor 1001, such as a CPU (Central Processing Unit ), a memory 1005, and a communication bus 1002. Wherein a communication bus 1002 is used to enable connected communication between the processor 1001 and a memory 1005. The memory 1005 may be a high-speed RAM (random access memory ) or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
Optionally, the word recognition error correction device may further include a user interface, a network interface, a camera, an RF (Radio Frequency) circuit, a sensor, an audio circuit, a WiFi (Wireless Fidelity, wireless broadband) module, and the like. The user interface may comprise a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface may further comprise a standard wired interface, a wireless interface. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface).
It will be appreciated by those skilled in the art that the word recognition error correction apparatus structure shown in fig. 4 is not limiting of the word recognition error correction apparatus, and may include more or fewer components than shown, or may combine certain components, or may be a different arrangement of components.
As shown in fig. 4, an operating system, a network communication module, and a word recognition error correction program may be included in the memory 1005, which is a type of computer storage medium. The operating system is a program that manages and controls the hardware and software resources of the word recognition error correction device, supporting the execution of word recognition error correction programs and other software and/or programs. The network communication module is used to implement communication between components within the memory 1005 and other hardware and software in the word recognition and error correction device.
In the word recognition and error correction apparatus shown in fig. 4, a processor 1001 is configured to execute a word recognition and error correction program stored in a memory 1005, and implement the steps in the embodiments of the word recognition and error correction method described above.
The present invention provides a computer-readable storage medium storing one or more programs that are further executable by one or more processors for implementing the steps in the above-described embodiments of a word recognition error correction method.
It should also be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structural changes made by the specification and drawings of the present invention or direct/indirect application in other related technical fields are included in the scope of the present invention.

Claims (8)

1. The character recognition and error correction method is characterized by comprising the following steps of:
when a file to be corrected is received, reading an extension name of the file to be corrected, and determining the attribute of the file to be corrected according to the extension name;
judging whether the attribute of the file to be corrected is a read-only file or not, if so, performing attribute conversion on the file to be corrected to generate an editable file;
reading a plurality of keywords in the editable file to form a keyword group, and determining the target file type of the editable file according to the keyword group;
determining a target error correction library corresponding to the type of the target file according to the preset mapping relation between the type of each editable file and the error correction library, and calling the target error correction library to correct the editable file;
the step of converting the attribute of the file to be corrected to generate the editable file comprises the following steps:
Scanning the file to be corrected, and determining a title and a paragraph in the file to be corrected according to the size relation and the interval relation among the characters in the file to be corrected;
scanning the characters in the title and the paragraph one by one, identifying the scanned characters according to a preset character library, and adding a title identifier to the identified title characters;
transmitting the identified title characters and paragraph characters to a preset editor to generate an editable file;
the step of reading a plurality of keywords in the editable file to form a keyword group comprises the following steps:
reading the word groups in the editable file, counting the occurrence frequency of each word group, and taking the word groups with the frequency larger than a preset value as key words;
and acquiring the phrase in the title according to the title identifier, and forming the phrase in the title and the keyword into a keyword phrase, wherein the title content or the title type in the file reflects the file type.
2. The word recognition error correction method of claim 1, wherein the step of invoking the target error correction library to correct the editable file comprises:
recognizing sentences in the editable file, detecting connecting words in each sentence, and dividing the sentences into a plurality of phrases to be recognized according to the connecting words;
Comparing the phrases to be identified with each preset phrase in the target error correction library one by one, and judging whether preset phrases consistent with the phrases to be identified exist in the target error correction library or not;
if the preset phrase consistent with the phrase to be identified does not exist in the target error correction library, acquiring the target preset phrase with the highest similarity with the phrase to be identified in the target error correction library, and replacing the phrase to be identified with the target preset phrase.
3. The text recognition error correction method of claim 2, wherein the step of replacing the phrase to be recognized with the target preset phrase comprises:
acquiring a phrase to be recognized adjacent to a current phrase to be recognized, forming a phrase to be recognized from the adjacent phrase to be recognized and a target preset phrase, and judging semantic scene matching of the target preset phrase and the editable file according to the phrase to be recognized;
if the target preset phrase is matched with the editable file, the target preset phrase is used for replacing the phrase to be identified.
4. The text recognition and correction method of claim 1, wherein the step of determining the target file type of the editable file based on the keyword group includes:
comparing the key phrase with a preset key phrase library, and determining a target key phrase in the preset key phrase library, wherein the element matching rate of the target key phrase and the key phrase is highest;
According to the mapping relation between the key word groups and the file types in the preset key word group library, determining the target file type corresponding to the target key word groups, and determining the corresponding target file type as the target file type of the editable file.
5. The word recognition error correction method of any one of claims 1-4, wherein the step of invoking the target error correction library to correct the editable file comprises:
outputting the edited file subjected to error correction, and transmitting correction words corresponding to the correction operation to a target error correction library when the correction operation of the outputted edited file is received, so as to update the target error correction library.
6. A word recognition and error correction apparatus, comprising:
the reading module is used for reading the extension name of the file to be corrected when the file to be corrected is received, and determining the attribute of the file to be corrected according to the extension name;
the judging module is used for judging whether the attribute of the file to be corrected is a read-only file, and if the attribute of the file to be corrected is the read-only file, performing attribute conversion on the file to be corrected to generate an editable file;
the determining module is used for reading a plurality of keywords in the editable file to form a keyword group and determining the target file type of the editable file according to the keyword group;
The error correction module is used for determining a target error correction library corresponding to the type of the target file according to the preset mapping relation between the type of each editable file and the error correction library, and calling the target error correction library to correct the editable file;
the judging module is used for realizing: scanning the file to be corrected, and determining a title and a paragraph in the file to be corrected according to the size relation and the interval relation among the characters in the file to be corrected; scanning the characters in the title and the paragraph one by one, identifying the scanned characters according to a preset character library, and adding a title identifier to the identified title characters; transmitting the identified title characters and paragraph characters to a preset editor to generate an editable file;
the judging module is further configured to implement: reading the word groups in the editable file, counting the occurrence frequency of each word group, and taking the word groups with the frequency larger than a preset value as key words; and acquiring the phrase in the title according to the title identifier, and forming the phrase in the title and the keyword into a keyword phrase, wherein the title content or the title type in the file reflects the file type.
7. A word recognition and error correction apparatus, comprising: a memory, a processor, a communication bus, and a word recognition error correction program stored on the memory;
The communication bus is used for realizing connection communication between the processor and the memory;
the processor is configured to execute the word recognition error correction program to implement the steps of the word recognition error correction method as claimed in any one of claims 1-5.
8. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a text recognition and error correction program, which when executed by a processor, implements the steps of the text recognition and error correction method according to any of claims 1-5.
CN201810430989.8A 2018-05-07 2018-05-07 Character recognition error correction method, device, equipment and computer readable storage medium Active CN108664471B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810430989.8A CN108664471B (en) 2018-05-07 2018-05-07 Character recognition error correction method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810430989.8A CN108664471B (en) 2018-05-07 2018-05-07 Character recognition error correction method, device, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN108664471A CN108664471A (en) 2018-10-16
CN108664471B true CN108664471B (en) 2024-01-23

Family

ID=63778807

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810430989.8A Active CN108664471B (en) 2018-05-07 2018-05-07 Character recognition error correction method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN108664471B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110147791A (en) * 2019-05-20 2019-08-20 上海联影医疗科技有限公司 Character recognition method, device, equipment and storage medium
CN111079417A (en) * 2019-12-17 2020-04-28 米哈游科技(上海)有限公司 Wrongly written character checking method, wrongly written character checking device, server and storage medium
CN111310473A (en) * 2020-02-04 2020-06-19 四川无声信息技术有限公司 Text error correction method and model training method and device thereof
CN111639566A (en) * 2020-05-19 2020-09-08 浙江大华技术股份有限公司 Method and device for extracting form information
CN111626049B (en) * 2020-05-27 2022-12-16 深圳市雅阅科技有限公司 Title correction method and device for multimedia information, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095778A (en) * 2016-05-26 2016-11-09 达而观信息科技(上海)有限公司 The Chinese search word automatic error correction method of search engine
CN106991416A (en) * 2017-03-14 2017-07-28 浙江大学 It is a kind of based on the laboratory test report recognition methods taken pictures manually
CN107633250A (en) * 2017-09-11 2018-01-26 畅捷通信息技术股份有限公司 A kind of Text region error correction method, error correction system and computer installation
CN107741928A (en) * 2017-10-13 2018-02-27 四川长虹电器股份有限公司 A kind of method to text error correction after speech recognition based on field identification
CN107818289A (en) * 2016-09-13 2018-03-20 北京搜狗科技发展有限公司 A kind of prescription recognition methods and device, a kind of device for prescription identification

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095778A (en) * 2016-05-26 2016-11-09 达而观信息科技(上海)有限公司 The Chinese search word automatic error correction method of search engine
CN107818289A (en) * 2016-09-13 2018-03-20 北京搜狗科技发展有限公司 A kind of prescription recognition methods and device, a kind of device for prescription identification
CN106991416A (en) * 2017-03-14 2017-07-28 浙江大学 It is a kind of based on the laboratory test report recognition methods taken pictures manually
CN107633250A (en) * 2017-09-11 2018-01-26 畅捷通信息技术股份有限公司 A kind of Text region error correction method, error correction system and computer installation
CN107741928A (en) * 2017-10-13 2018-02-27 四川长虹电器股份有限公司 A kind of method to text error correction after speech recognition based on field identification

Also Published As

Publication number Publication date
CN108664471A (en) 2018-10-16

Similar Documents

Publication Publication Date Title
CN108664471B (en) Character recognition error correction method, device, equipment and computer readable storage medium
CN108563768B (en) Data conversion method, device, equipment and storage medium for different data models
CN104915264A (en) Input error-correction method and device
CN111444750B (en) PDF document identification method and device and electronic equipment
CN110008807B (en) Training method, device and equipment for contract content recognition model
CN113657088A (en) Interface document analysis method and device, electronic equipment and storage medium
CN115392235A (en) Character matching method and device, electronic equipment and readable storage medium
CN111325031A (en) Resume parsing method and device
CN110990010A (en) Software interface code generation method and device
CN111126058A (en) Text information automatic extraction method and device, readable storage medium and electronic equipment
CN113342954A (en) Image information processing method and device applied to question-answering system and electronic equipment
CN112464927B (en) Information extraction method, device and system
US11475068B2 (en) Automatic question answering method and apparatus, storage medium and server
CN117033309A (en) Data conversion method and device, electronic equipment and readable storage medium
CN115858776B (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN111492364B (en) Data labeling method and device and storage medium
CN114118049B (en) Information acquisition method, device, electronic equipment and storage medium
CN113220949B (en) Construction method and device of private data identification system
CN113283231B (en) Method for acquiring signature bit, setting system, signature system and storage medium
CN114154480A (en) Information extraction method, device, equipment and storage medium
CN114818716A (en) Risk subject identification method and device, storage medium and equipment
CN113886748A (en) Method, device and equipment for generating editing information and outputting information of webpage content
CN114238689A (en) Video generation method, video generation device, electronic device, storage medium, and program product
CN114067343A (en) Data set construction method, model training method and corresponding device
CN111695031A (en) Label-based searching method, device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20231229

Address after: Room 1104, 11th Floor, Building 16, No. 6 Wenhuayuan West Road, Beijing Economic and Technological Development Zone, Daxing District, Beijing, 100000

Applicant after: Beijing Yiyin Technology Co.,Ltd.

Address before: 518000 Room 202, block B, aerospace micromotor building, No.7, Langshan No.2 Road, Xili street, Nanshan District, Shenzhen City, Guangdong Province

Applicant before: Shenzhen LIAN intellectual property service center

Effective date of registration: 20231229

Address after: 518000 Room 202, block B, aerospace micromotor building, No.7, Langshan No.2 Road, Xili street, Nanshan District, Shenzhen City, Guangdong Province

Applicant after: Shenzhen LIAN intellectual property service center

Address before: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.)

Applicant before: PING AN PUHUI ENTERPRISE MANAGEMENT Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant