CN113011406A - Single-template working flow optimization method - Google Patents

Single-template working flow optimization method Download PDF

Info

Publication number
CN113011406A
CN113011406A CN202110312418.6A CN202110312418A CN113011406A CN 113011406 A CN113011406 A CN 113011406A CN 202110312418 A CN202110312418 A CN 202110312418A CN 113011406 A CN113011406 A CN 113011406A
Authority
CN
China
Prior art keywords
word
text
error
corrected
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110312418.6A
Other languages
Chinese (zh)
Inventor
玄洪升
李明明
潘心冰
郭保荣
冷静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Cloud Information Technology Co Ltd
Original Assignee
Inspur Cloud Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Cloud Information Technology Co Ltd filed Critical Inspur Cloud Information Technology Co Ltd
Priority to CN202110312418.6A priority Critical patent/CN113011406A/en
Publication of CN113011406A publication Critical patent/CN113011406A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/98Detection or correction of errors, e.g. by rescanning the pattern or by human intervention; Evaluation of the quality of the acquired patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Abstract

The invention relates to the field of deep learning and image character recognition, and particularly provides a single-template workflow optimization method, which comprises the following steps: s1, constructing an N-Gram knowledge base in the field of template pictures; s2, preprocessing the OCR error correction text into a sequence to be corrected; s3, identifying suspected wrong words; s4, constructing a correct candidate word list based on the error rule reasoning model; s5, comprehensive weight generation is carried out by using a candidate character screening algorithm, the weights are ranked, and the character with the highest weight is the correct character of the error character in the current text to be corrected. Compared with the prior art, the method improves the usability of the image in the single-template workflow after OCR conversion, and is applied to the framing reference field, the framing identification area and the evaluation application stage, so that the precision of the single-template workflow character identification model is improved, and the extraction effect of the structured information is ensured.

Description

Single-template working flow optimization method
Technical Field
The invention relates to the field of deep learning and image character recognition, and particularly provides a single-template workflow optimization method.
Background
The single template workflow can independently construct a character recognition template, recognize characters in a template picture, provide a high-precision character recognition model and ensure the extraction precision of structural information, wherein the usability of image OCR text recognition is a key index of the single template workflow in a framing reference field, a framing recognition area and an evaluation application stage, as shown in FIG. 4.
The text information extraction of the picture is mainly realized based on an Optical Character Recognition (OCR) technology, and a better optimized OCR technology can correctly recognize most text contents, but a Recognition error of a part of text is still existed.
The problem of text error correction is accompanied with the development of computer technology, and at present, a great number of scientific researchers are studying related fields at home and abroad. Text error correction is mainly divided into two parts, the first step is error detection, and the second step is error correction. At present, the mainstream method is to cut words by a Chinese word segmentation device in the Chinese of the crust at the error detection part, and because the sentences contain wrongly written characters, the word cutting result is often in the situation of wrong cutting, so that errors are detected from two aspects of character granularity and word granularity, and suspected error results of the two granularities are integrated to form a suspected error position candidate set.
The error correction part is to traverse all suspected error positions, replace words in the error positions by using similar dictionaries, then calculate sentence confusion degree through a language model, compare and sort results of all candidate sets to obtain the optimal corrected words.
In this regard, how to improve the usability of the text after the image is subjected to OCR becomes one of the bottlenecks of the single template workflow.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a single-template workflow optimization method with strong practicability.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a single-template working flow optimization method comprises the following steps:
s1, constructing an N-Gram knowledge base in the field of template pictures;
s2, preprocessing the OCR error correction text into a sequence to be corrected;
s3, identifying suspected wrong words;
s4, constructing a correct candidate word list based on the error rule reasoning model;
s5, comprehensive weight generation is carried out by using a candidate character screening algorithm, the weights are ranked, and the character with the highest weight is the correct character of the error character in the current text to be corrected.
Further, in step S1, BiGram segmentation is performed on the collected template picture field text set, word collocation in the field text set is analyzed and counted, a template picture field word collocation probability matrix is generated, and a template picture field corpus N-Gram knowledge base is constructed as a text error judgment mechanism background knowledge base for performing error position judgment on the text with error correction.
Preferably, in step S2, text preprocessing is performed on the text to be corrected, including removing numbers, symbols and english letters in the sentence, and BiGram segmentation is performed on the sentence by using an N-Gram mechanism, so as to obtain a text sequence with error correction.
Further, in step S3, the text to be corrected is subjected to text preprocessing to obtain a BiGram segmented word sequence of the text to be corrected, and the BiGram segmented word sequence is compared with the N-Gram knowledge base of the template picture field text set to refer to the field word collocation probability matrix, and if the probability is lower than the threshold value, it is determined that a text error exists in the BiGram group.
Further, in step S4, after determining the error position of the word in the text to be corrected, the word is subjected to OCR conversion error rule inference using an error rule inference model.
Further, in step S4, the chinese character structure attribute knowledge map is mainly composed of the chinese character itself, the corresponding table-shape code of the chinese character, the stroke of the chinese character, and the error rule occurring between the chinese character and the chinese character during the OCR conversion, and the map gives a unique number to each chinese character, and the chinese character number is used as the main node of all the relations, thereby forming a complete chinese character structure attribute knowledge map.
Further, in step S4, according to the prediction result that the word has errors in OCR conversion, a correct word candidate interval of the erroneous word is generated for the subsequent candidate word screening algorithm to extract, and according to the inference result of the erroneous word in the OCR conversion error rule inference model, a candidate word list is generated.
Further, in step S5, the filtering for the candidate word list is performed in two parts,
and traversing and checking a word collocation combination of each candidate word in the candidate word list, which corresponds to context BiGram segmentation in the text to be corrected where the suspected wrong word is located, according to a BiGram field word collocation template picture knowledge base, and directly eliminating words which do not form a reasonable sentence with the context of the context position where the suspected wrong word is located in the text to be corrected if the words in the candidate words and the context of the context position where the suspected wrong word is located do not form the reasonable sentence, so that the words are not kept in the candidate word list for weight calculation of the screening algorithm.
Further, in step S5, the context position of each candidate word and the suspected erroneous word is calculated by using the TF-IDF algorithm and the cosine similarity for the screening algorithm of the candidate word list, which represents the reasonable degree of the sentence formed by the candidate word and the context of the erroneous position in the text to be corrected, and the degrees of similarity between the candidate word in the candidate word list and the erroneous word in the text to be corrected, which are compared by the BiGram template picture knowledge base, are compared, so as to select the most reasonable candidate word.
Compared with the prior art, the single-template working flow optimization method has the following outstanding beneficial effects:
the single-template workflow optimization method improves the usability of the text of the image after OCR conversion in the single-template workflow, and is applied to the framing reference field, the framing identification area and the evaluation application stage, so that the precision of the character identification model of the single-template workflow is improved, and the extraction effect of the structured information is ensured.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a diagram of a text error correction framework in a single-template workflow optimization method;
FIG. 2 is a flow chart of an error determination mechanism in a single-template workflow optimization method;
FIG. 3 is a diagram of a fault rule reasoning model in a single-template workflow optimization method;
fig. 4 is a flow chart of a single template in the prior art.
Detailed Description
The present invention will be described in further detail with reference to specific embodiments in order to better understand the technical solutions of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A preferred embodiment is given below:
as shown in fig. 1, the method for optimizing a single-mode working flow in this embodiment includes the following steps:
s1, constructing an N-Gram knowledge base in the field of template pictures:
firstly, BiGram segmentation is carried out on a collected template picture field text set, word collocation in the field text set is analyzed and counted, a template picture field word collocation probability matrix is generated, and a template picture field text set N-Gram knowledge base is constructed and serves as a text error judgment mechanism background knowledge base for carrying out error position judgment on an error-corrected text.
The BiGram (two-interval) is a refining standard widely used in the N-Gram technology, and compared with Tri Gram or 5-Gram and the like, the BiGram has finer granularity on the aspect of text word collocation statistical analysis, can obtain a more accurate field word collocation probability matrix, and can provide higher precision on the aspect of text error position judgment.
S2, preprocessing the OCR error correction text into a sequence to be corrected:
and performing text preprocessing on the text to be corrected, including removing numbers, symbols, English letters and the like in the sentence, and performing Bigram segmentation on the sentence by using an N-Gram mechanism to obtain a text sequence with the error correction.
S3, identifying the suspected wrong word:
and preprocessing the text to be corrected to obtain a BiGram segmentation word sequence of the text to be corrected, comparing the BiGram segmentation word sequence with a template picture field text set N-Gram knowledge base to refer to a field word collocation probability matrix, and judging that a text error exists in a BiGram group if the probability is lower than a threshold value.
As shown in fig. 3, W1, W2, W3, and W4 respectively represent four words in the N-Gram sequence of the text to be corrected, after BiGram segmentation, three BiGram sequences of W1W2, W2W3, and W3W4 are respectively provided, and by performing word collocation probability analysis on each BiGram group, when the word collocation probability of W2W3 is lower than a threshold, it is determined that a text error exists in the W2W3 combination. And then carrying out word collocation probability analysis on the Bigram combinations W1W2 and W3W4 of W2 and W3 respectively, wherein the combination with higher probability is regarded as correct, and the combination with lower probability is regarded as error, so that the specific position of the error word in the W2W3 combination is confirmed.
S4, constructing a correct candidate word list based on the error rule reasoning model:
as shown in fig. 4, after determining the error position of a word in the text to be corrected, an error rule inference model is used to perform OCR conversion error rule inference on the word.
The Chinese character structure attribute knowledge map is mainly composed of Chinese characters, corresponding form codes of the Chinese characters, Chinese character strokes and error rules which occur between the Chinese characters during OCR conversion, unique numbers are assigned to the Chinese characters in the map, and the Chinese character numbers are used as main nodes of all relations to form the complete Chinese character structure attribute knowledge map.
And generating a correct character candidate interval of the wrong character according to a wrong prediction result of the character in OCR conversion for extraction by a subsequent candidate word screening algorithm. And generating a candidate word list according to the reasoning result of the error word in the OCR conversion error rule reasoning model.
S5, comprehensive weight generation is carried out by using a candidate character screening algorithm, the weights are ranked, and the character with the highest weight is the correct character of the wrong character in the current text to be corrected:
the screening for the candidate word list is performed in two parts.
Firstly, traversing and checking a word collocation combination of each candidate word in a candidate word list, which corresponds to context BiGram segmentation in a text to be corrected where a suspected error word is located, according to a BiGram domain word collocation template picture knowledge base, and directly excluding words which do not form a reasonable sentence with the context before and after the context position where the suspected error word is located in the text to be corrected, if the words in the candidate words and the context before and after the context position where the suspected error word is located are not included in a candidate word list for weight calculation of a screening algorithm.
And then, calculating the context position of each candidate character and the suspected error character by using a TF-IDF algorithm and cosine similarity aiming at a screening algorithm of the candidate character list, representing the reasonable degree of the sentence formed by the candidate character and the error position context in the text to be corrected, comparing the similarity degree of the candidate character in the candidate character list and the error character in the text to be corrected after comparison of a Bigram template picture knowledge base, and selecting the most reasonable candidate character.
The above embodiments are only specific ones of the present invention, and the scope of the present invention includes but is not limited to the above embodiments, and any suitable changes or substitutions that are consistent with the claims of a single-mode-board working-flow optimization method of the present invention and are made by those of ordinary skill in the art shall fall within the scope of the present invention.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (9)

1. A single-template working flow optimization method is characterized by comprising the following steps:
s1, constructing an N-Gram knowledge base in the field of template pictures;
s2, preprocessing the OCR error correction text into a sequence to be corrected;
s3, identifying suspected wrong words;
s4, constructing a correct candidate word list based on the error rule reasoning model;
s5, comprehensive weight generation is carried out by using a candidate character screening algorithm, the weights are ranked, and the character with the highest weight is the correct character of the error character in the current text to be corrected.
2. The method of claim 1, wherein in step S1, BiGram segmentation is performed on the collected template picture domain text set, word collocation in the domain text set is analyzed and counted, a template picture domain word collocation probability matrix is generated, and a template picture domain text set N-Gram knowledge base is constructed as a text error judgment mechanism background knowledge base for performing error position judgment on the error-corrected text.
3. The method according to claim 1, wherein in step S2, the text to be corrected is pre-processed, including removing numbers, symbols and english letters in sentences, and the sentences are BiGram segmented by using an N-Gram mechanism to obtain text sequences with error correction.
4. The method according to claim 1, wherein in step S3, the text to be corrected is pre-processed to obtain a BiGram segmented word sequence of the text to be corrected, and the BiGram segmented word sequence is compared with a domain word collocation probability matrix of the template picture domain text set N-Gram knowledge base, and if the probability is lower than a threshold value, it is determined that a text error exists in the BiGram group.
5. The single-template workflow optimization method of claim 1, wherein in step S4, after determining the error position of a word in the text to be corrected, an OCR transformation error rule inference model is used to perform OCR transformation error rule inference on the word.
6. The single-template workflow optimization method of claim 5, wherein in step S4, the Chinese character structure attribute knowledge graph is mainly composed of the Chinese characters themselves, corresponding pictographic codes of the Chinese characters, strokes of the Chinese characters and error rules between the Chinese characters during OCR conversion, wherein the graph is assigned with unique numbers for each Chinese character, and the Chinese character numbers are used as the main nodes of all relations to form a complete Chinese character structure attribute knowledge graph.
7. The single-template workflow optimization method of claim 6, wherein in step S4, a correct word candidate interval of the wrong word is generated according to a predicted result of the word with errors in OCR transformation, and is used for extraction by a subsequent candidate word screening algorithm, and a candidate word list is generated according to a reasoning result of the wrong word in an OCR transformation error rule reasoning model.
8. The single-template workflow optimization method according to claim 1, wherein in step S5, the screening for the candidate word list is divided into two parts,
and traversing and checking a word collocation combination of each candidate word in the candidate word list, which corresponds to context BiGram segmentation in the text to be corrected where the suspected wrong word is located, according to a BiGram field word collocation template picture knowledge base, and directly eliminating words which do not form a reasonable sentence with the context of the context position where the suspected wrong word is located in the text to be corrected if the words in the candidate words and the context of the context position where the suspected wrong word is located do not form the reasonable sentence, so that the words are not kept in the candidate word list for weight calculation of the screening algorithm.
9. The single template workflow optimization method of claim 8, wherein in step S5, the screening algorithm for the candidate word list calculates the context location of each candidate word and the suspected error word by using the TF-IDF algorithm and the cosine similarity, which represents the reasonable degree of the sentence formed by the candidate word and the context of the error location in the text to be corrected, compares the similarity degree of the candidate word and the error word in the candidate word list after the BiGram template picture knowledge base comparison in the text to be corrected, and picks out the most reasonable candidate word.
CN202110312418.6A 2021-03-24 2021-03-24 Single-template working flow optimization method Pending CN113011406A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110312418.6A CN113011406A (en) 2021-03-24 2021-03-24 Single-template working flow optimization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110312418.6A CN113011406A (en) 2021-03-24 2021-03-24 Single-template working flow optimization method

Publications (1)

Publication Number Publication Date
CN113011406A true CN113011406A (en) 2021-06-22

Family

ID=76405926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110312418.6A Pending CN113011406A (en) 2021-03-24 2021-03-24 Single-template working flow optimization method

Country Status (1)

Country Link
CN (1) CN113011406A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327071A (en) * 2021-08-04 2021-08-31 深圳市深水水务咨询有限公司 5G-based environment management method and device, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866457A (en) * 2019-10-28 2020-03-06 世纪保众(北京)网络科技有限公司 Electronic insurance policy obtaining method and device, computer equipment and storage medium
CN111859921A (en) * 2020-07-08 2020-10-30 金蝶软件(中国)有限公司 Text error correction method and device, computer equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866457A (en) * 2019-10-28 2020-03-06 世纪保众(北京)网络科技有限公司 Electronic insurance policy obtaining method and device, computer equipment and storage medium
CN111859921A (en) * 2020-07-08 2020-10-30 金蝶软件(中国)有限公司 Text error correction method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张笑文: "基于知识图谱的OCR转换文本纠错方法研究与应用", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327071A (en) * 2021-08-04 2021-08-31 深圳市深水水务咨询有限公司 5G-based environment management method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN114610515B (en) Multi-feature log anomaly detection method and system based on log full semantics
KR100630886B1 (en) Character string identification
JP3950535B2 (en) Data processing method and apparatus
CN113806563B (en) Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material
CN111062397A (en) Intelligent bill processing system
CN110110334B (en) Remote consultation record text error correction method based on natural language processing
Reffle et al. Unsupervised profiling of OCRed historical documents
CN117076653B (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
CN109145287A (en) Indonesian word error-detection error-correction method and system
CN111460164A (en) Intelligent barrier judgment method for telecommunication work order based on pre-training language model
CN113987199A (en) BIM intelligent image examination method, system and medium with standard automatic interpretation
CN116306600A (en) MacBert-based Chinese text error correction method
CN113420766B (en) Low-resource language OCR method fusing language information
CN104572632A (en) Method for determining translation direction of word with proper noun translation
CN113011406A (en) Single-template working flow optimization method
US11301627B2 (en) Contextualized character recognition system
CN116757188A (en) Cross-language information retrieval training method based on alignment query entity pairs
JP6168057B2 (en) Failure occurrence cause extraction device, failure occurrence cause extraction method, and failure occurrence cause extraction program
CN110807096A (en) Information pair matching method and system on small sample set
KR101359039B1 (en) Analysis device and method for analysis of compound nouns
CN111428475B (en) Construction method of word segmentation word stock, word segmentation method, device and storage medium
CN114548075A (en) Text processing method, text processing device, storage medium and electronic equipment
Mohapatra et al. Spell checker for OCR
CN115688748A (en) Question error correction method and device, electronic equipment and storage medium
CN110472243A (en) A kind of Chinese spell checking methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210622