CN113269192B - OCR post-processing method based on word matching and grammar matching - Google Patents

OCR post-processing method based on word matching and grammar matching Download PDF

Info

Publication number
CN113269192B
CN113269192B CN202110567957.4A CN202110567957A CN113269192B CN 113269192 B CN113269192 B CN 113269192B CN 202110567957 A CN202110567957 A CN 202110567957A CN 113269192 B CN113269192 B CN 113269192B
Authority
CN
China
Prior art keywords
word
matching
grammar
recognition
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110567957.4A
Other languages
Chinese (zh)
Other versions
CN113269192A (en
Inventor
薛翔天
孔祥龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202110567957.4A priority Critical patent/CN113269192B/en
Publication of CN113269192A publication Critical patent/CN113269192A/en
Application granted granted Critical
Publication of CN113269192B publication Critical patent/CN113269192B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses an OCR post-processing method based on word matching and grammar matching technology, which comprises the steps of obtaining first K result sets corresponding to each word through single word recognition, for each text, taking the recognition result of the maximum probability corresponding to each word as a preliminary sequence to segment words, carrying out word matching operation based on a corpus on words after word segmentation, and selecting word combination update words with the maximum probability in a pre-recognition module; and carrying out grammar matching operation on the word after word segmentation, respectively carrying out grammar analysis matching on K results of word recognition, and updating the word by taking the most possible result. And the two fused steps are used as output results of post-processing. The invention fully digs the text syntax information, respectively processes single words and multiple words, shows good adaptability, and has more obvious advantages and higher application value compared with the traditional word matching-based technology, especially on texts with lower quality.

Description

OCR post-processing method based on word matching and grammar matching
Technical Field
The invention relates to an OCR post-processing method based on word matching and grammar matching, belonging to the technical field of OCR processing.
Background
OCR (Optical Character Recognition ) is the reading of text printed or written on paper using optical and computer technology and conversion to a format that can be accepted by a computer and understood by the person. It is a relatively broad problem that varies in the requirements and standards of different specific scenarios, as well as in the fault tolerance. The flow of OCR is generally divided into the following steps: text detection, text recognition and post-processing. Post-processing is an important component of OCR because errors that are very common to environmental noise, or word-in-shape, are very much like word recognition, and we often desire error correction from corpus and context information by context. Classical solution algorithms are of two types: 1) An improved BK-tree based on a priori dictionary; 2) Language model based error correction mechanism.
The current post-processing is mainly aimed at multi-word words, and few processing methods are used for single words. Aiming at the problem, the invention integrates the word matching technology aiming at the multi-word and the grammar matching technology aiming at the single word, so that the effect of OCR post-processing is better.
Disclosure of Invention
The invention discloses an OCR (Optical Character Recognition ) post-processing method based on word matching and grammar matching technology, which acquires the first K result sets of corresponding recognition of each word through single word recognition. For each text segment, taking the recognition result of the maximum probability corresponding to each word as a preliminary sequence to segment words, carrying out word matching operation based on a corpus on the segmented words, and selecting the word combination with the maximum probability in the pre-recognition module to update the words; and carrying out grammar matching operation on the word after word segmentation, respectively carrying out grammar analysis matching on K results of word recognition, and updating the word by taking the most possible result. And the two fused steps are used as output results of post-processing.
In order to achieve the above object, the technical scheme of the invention is as follows, an OCR post-processing method based on word matching and grammar matching, comprising the following steps:
Step 1), a prepositive OCR single word recognition module is used for positioning the text information in a scene and recognizing the single word through the prepositive OCR module, and the most likely top K recognition results and the corresponding probabilities thereof are stored;
Step 2) word segmentation, namely taking the recognition result of the maximum probability of each character as an initial result, and using a main stream word segmentation tool to segment the text sequence;
Step 3) based on the forward maximum word matching of the Chinese dictionary, for the multi-word words after word segmentation, utilizing the compared identification words and possible similar candidate word groups thereof to find out the most logical words according to the front and rear identification words, and correcting the initial result;
Step 4) multi-material lexical segmentation, namely according to the result of the step 2), substituting K recognition results of the single word into texts to respectively carry out grammar segmentation and save;
and 5) screening K different grammar segmentation results according to the prior knowledge of the grammar for the single word after word segmentation based on grammar matching of a Chinese grammar library, and then selecting the result with the highest probability value in the step 1) for correction.
Step 6) after processing the single word and the multi-word respectively, outputting the recognition result.
In the preferred scheme of the method, in the step 1), the input text is set as x= (X 1,x2,...,xn), and the output result of the single word recognizer is set as y= (Y 1,y2,...,yn). Wherein y i is the result of recognition of the single word xi of the input text X, yi= { (y i_1,pi_1),(yi_2,pi_2),...,(yi_K,pi_K) } contains the first K result sets with the largest probability values in the recognition classification result, and each tuple contains the recognition result and the corresponding probability.
In the preferred scheme of the method, a word segmentation tool in the step 2) adopts jieba word segmentation, a jieba word segmentation algorithm uses a prefix dictionary-based efficient word graph scanning to generate a Directed Acyclic Graph (DAG) formed by all possible word generation conditions of Chinese characters in sentences, a dynamic programming search maximum probability path is adopted to find out the maximum segmentation combination based on word frequency, and a HMM model based on the word formation capability of Chinese characters is adopted for unregistered words and a Viterbi algorithm is used. The Viterbi algorithm "backtracks" through a back pointer to determine if a certain hidden state is a member of the most likely sequence of hidden states. Noise in the sequence is effectively isolated.
In the preferred scheme of the method, the specific flow based on the forward maximum word matching of the Chinese dictionary in the step 3) is as follows:
(a) Assuming that n words exist in the multi-word, each word has K recognition results, K n multi-word combinations are obtained after random combination, and the corresponding word probabilities in the word combinations are multiplied and normalized to obtain the word combination recognition probability P1= { P' 1,p'2,...,p'K^n }. And then putting the multi-word combinations into a corpus for matching, counting the occurrence times of each word combination in the corpus, and recording as { c 1,c2,...,cK^n }, thereby obtaining the occurrence probability of the multi-word and carrying out normalization to obtain the result as P2= { c 1/sum{ci},c2/sum{ci},...,cK*n/sum{ci }.
(B) The combination recognition probability of the word is combined with the occurrence probability of the multi-word, and a weight factor alpha is added, and the final combination recognition probability is P=alpha P1+ (1-alpha) P2. The phrase corresponding to the highest probability value is combined into the final result. The step comprehensively considers the two discrimination results, thereby effectively reducing errors.
In the preferred scheme of the method, the specific flow of grammar matching based on the Chinese grammar library in the step 5) is as follows:
The method adopts a corpus of modern Chinese grammar information dictionary of Beijing university, takes the word to be identified as the center, and identifies the character to be detected by carrying out grammar matching check with the context. Grammar matching mainly utilizes grammar semantic knowledge provided by word libraries in a corpus. Taking noun word stock as an example. Items examined herein at grammar matching include the following: several, individual, metric, container, collective, category, shaping, indefinite, move-time, front, rear, front generation, front-to-back. And deleting the single word recognition result which does not accord with the grammar matching rule, and then, updating the single word by taking the recognition result with the highest probability.
The invention fully digs the text syntax information, respectively processes single words and multiple words, shows good adaptability, and has more obvious advantages and higher application value compared with the traditional word matching-based technology, especially on texts with lower quality. Compared with the prior art, the invention has the following advantages:
(1) And combining the classification probability of single word recognition with the post-processing matching probability, and improving the information association degree and the utilization rate before and after the structure. The traditional post-processing method does not use the probability value of the recognition classification module, so that the information utilization rate is greatly reduced, and the accuracy of the post-processing method is completely depended. According to the invention, the probability value of the single word recognition module is fused with the probability value of the post-processing module, so that the false detection rate of the post-processing module is effectively reduced, and the adaptability of the whole structure is improved.
(2) The accuracy and the integrity of the post-processing scheme are improved. The traditional method carries out word matching analysis on the word after word segmentation, and ignores the processing and grammar analysis of single words. The invention respectively carries out semantic and grammar optimization processing on the multi-word words and the single words based on word matching and grammar matching technology, so that each word in the text is covered by the post-processing module, and the accuracy and performance of post-processing are improved.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
Fig. 2 is a lexical cut schematic diagram.
The specific embodiment is as follows:
in order to enhance the understanding of the present invention, the present embodiment will be described in detail with reference to the accompanying drawings.
Examples: the OCR post-processing method based on word matching and grammar matching obtains the first K result sets corresponding to each word through single word recognition. For each text segment, taking the recognition result of the maximum probability corresponding to each word as a preliminary sequence to segment words, carrying out word matching operation based on a corpus on the segmented words, and selecting the word combination with the maximum probability in the pre-recognition module to update the words; performing grammar matching operation on the word after word segmentation, and respectively feeding K results of word recognition into the system structure:
Fig. 1 shows the architecture of the OCR post-processing method based on word matching and grammar matching, and a detailed description of two main parts is given below.
1. Forward maximum word matching based on Chinese dictionary:
(a) Assuming that n words exist in the multi-word, each word has K recognition results, K n multi-word combinations are obtained after random combination, and the corresponding word probabilities in the word combinations are multiplied and normalized to obtain the word combination recognition probability P1= { P' 1,p'2,...,p'K^n }. And then putting the multi-word combinations into a corpus for matching, counting the occurrence times of each word combination in the corpus, and recording as { c 1,c2,...,cK^n }, thereby obtaining the occurrence probability of the multi-word and carrying out normalization to obtain the result as P2= { c 1/sum{ci},c2/sum{ci},...,cK*n/sum{ci }.
(B) The combination recognition probability of the word is combined with the occurrence probability of the multi-word, and a weight factor alpha is added, and the final combination recognition probability is P=alpha P1+ (1-alpha) P2. The phrase corresponding to the highest probability value is combined into the final result.
2. Grammar matching based on Chinese grammar library:
The method adopts a corpus of modern Chinese grammar information dictionary of Beijing university, takes the word to be identified as the center, and identifies the character to be detected by carrying out grammar matching check with the context. Grammar matching mainly utilizes grammar semantic knowledge provided by word libraries in a corpus. Taking noun word stock as an example. Items examined herein at grammar matching include the following: several, individual, metric, container, collective, category, shaping, indefinite, move-time, front, rear, front generation, front-to-back. And deleting the single word recognition result which does not accord with the grammar matching rule, and then, updating the single word by taking the recognition result with the highest probability.
2. The specific process comprises the following steps:
referring to fig. 1, an NLP library combination usage technique based on overlap computation includes the following:
Step 1) a prepositive OCR single word recognition module is used for positioning the text information in the scene and recognizing the single word through the prepositive OCR module, and the most likely top K recognition results and the corresponding probabilities thereof are stored. Let the input text be x= (X 1,x2,...,xn), the output result of the word recognizer be y= (Y 1,y2,...,yn). Wherein y i is the result of recognition of the single word xi of the input text X, yi= { (y i_1,pi_1),(yi_2,pi_2),...,(yi_K,pi_K) } contains the first K result sets with the largest probability values in the recognition classification result, and each tuple contains the recognition result and the corresponding probability.
Step 2) word segmentation, namely taking the recognition result of the maximum probability of each word as an initial result, and using a main stream word segmentation tool to segment the text sequence.
And 3) based on the forward maximum word matching of the Chinese dictionary, for the multi-word words after word segmentation, utilizing the compared recognized words and possible similar candidate word groups thereof to find out the most logical words according to the front and rear recognized words, and correcting the initial result.
(A) Assuming that n words exist in the multi-word, each word has K recognition results, K n multi-word combinations are obtained after random combination, and the corresponding word probabilities in the word combinations are multiplied and normalized to obtain the word combination recognition probability P1= { P' 1,p'2,...,p'K^n }. Then putting the multi-word combinations into a corpus for matching, counting the occurrence times of each word combination in the corpus, and recording as { c 1,c2,...,cK^n }, thereby obtaining the occurrence probability of the multi-word and carrying out normalization to obtain a result of P2= { c 1/sum{ci},c2/sum{ci},...,cK*n/sum{ci };
(b) The combination recognition probability of the word is combined with the occurrence probability of the multi-word, and a weight factor alpha is added, and the final combination recognition probability is P=alpha P1+ (1-alpha) P2. The phrase corresponding to the highest probability value is combined into the final result.
And 4) multi-material lexical segmentation, namely respectively carrying out grammar segmentation on K recognition results of the single word and substituting the K recognition results into the text according to the result of the step 2), and storing the K recognition results. According to the word segmentation result, the multi-word which needs to be subjected to lexical segmentation is the updated result of the step 3), and the single-word takes the K recognition results obtained in the step 1) and carries out the grammar segmentation respectively;
And 5) screening K different grammar segmentation results according to the prior knowledge of the grammar for the single word after word segmentation based on grammar matching of a Chinese grammar library, and then selecting the result with the highest probability value in the step 1) for correction. And identifying the character to be detected by carrying out grammar matching check with the context by taking the word to be identified as the center. Grammar matching mainly, mainly grammar matching mainly, mainly provided by word libraries grammar semantic knowledge. And deleting the single word recognition result which does not accord with the grammar matching rule, and then, updating the single word by taking the recognition result with the highest probability.
Step 6) after processing the single word and the multi-word respectively, outputting the recognition result.
3. Specific application examples:
for convenience of description, it is assumed that there are the following simplified application examples: the text to be detected is selected as follows
X=(x1,x2,x3,x4)
According to the aforementioned calculation steps, the following steps are carried out:
First, the pre-OCR word recognition module stores the most likely top k=3 recognition results and their corresponding probabilities. The corresponding result is y= (Y 1,y2,y3). Wherein y 1 = { ("me", 0.7), ("Russian", 0.2), ("zedoary", 0.08) }, y 2 = { ("ai", 0.5), ("ai", 0.3), ("ir", 0.1) }, y 3 = { ("mid", 0.9), ("string", 0.02), ("-", 0.01) }, y 4 = { ("country", 0.3), ("threshold", 0.2), ("basket", 0.2) }.
Secondly, word segmentation is carried out, the recognition result with the highest probability in the first step is taken as an initial result, and a main stream word segmentation tool is used for text word segmentation, wherein the word segmentation result is I/E/Chinese.
Third, based on the maximum word match in the forward direction of the Chinese dictionary,
(A) Only one two words are included in the word segmentation result, each word has 3 recognition results, 9 multi-word combinations ('Chinese', 'middle threshold', 'middle frame', 'string country', 'string threshold', 'string frame', 'furo' and 'furframe') are obtained after random combination, and the corresponding word probabilities in the word combinations are multiplied and normalized to obtain word combination recognition probability P1= (0.35,0.23,0.23,0.08,0.05,0.05,0.003,0.003,0.003). And then putting the multi-word combinations into a corpus for matching, counting the occurrence times of each word combination in the corpus as (9,5,0,3,0,0,0,0,0), thereby obtaining the occurrence probability of the multi-word and carrying out normalization to obtain a result of P2= (0.53,0.29,0,0.18,0,0,0,0,0).
(B) The combination recognition probability and the occurrence probability of the multi-word are combined, a weight factor alpha=0.5 is added, and the final combination recognition probability is p=0.5×p1+0.5×p2. P= (0.44,0.26,0.12,0.13,0.025,0.025,0.0015,0.0015,0.0015), the phrase "chinese" corresponding to the highest probability value is used as the final result.
Fourth, multi-word segmentation is carried out, as shown in fig. 2, according to the word segmentation result, the multi-word needed to be subjected to word segmentation is the updated result in the step 3), the single-word takes the 3 recognition results obtained in the step 1), and the grammar segmentation is carried out respectively.
Fifthly, recognizing characters to be detected by carrying out grammar matching check with the context by centering on the words to be recognized based on grammar matching of the Chinese grammar library. Grammar matching mainly utilizes grammar semantic knowledge provided by word libraries in a corpus. And deleting the single word recognition result which does not accord with the grammar matching rule, and then, updating the single word by taking the recognition result with the highest probability. Here we eliminate the "fres" of x 1 and the "moxa" of x 2, and we consider that its syntax does not conform to the match check. The recognition result with the highest probability is selected to update the word. x 1 identifies "me" and x 2 updates to "love".
And sixthly, outputting the post-processed result and updating the result from the original 'I' Chinese in Chinese to 'I love Chinese'.
It should be noted that the above-mentioned embodiments are not intended to limit the scope of the present invention, and equivalent changes or substitutions made on the basis of the above-mentioned technical solutions fall within the scope of the present invention as defined in the claims.

Claims (6)

1. An OCR post-processing method based on word matching and grammar matching, the method comprising the steps of:
Step 1), a prepositive OCR single word recognition module is used for positioning the text information in a scene and recognizing the single word through the prepositive OCR module, and the most likely top K recognition results and the corresponding probabilities thereof are stored;
Step 2) word segmentation, namely taking the recognition result of the maximum probability of each character as an initial result, and using a main stream word segmentation tool to segment the text sequence;
Step 3) based on the forward maximum word matching of the Chinese dictionary, for the multi-word words after word segmentation, utilizing the compared identification words and possible similar candidate word groups thereof to find out the most logical words according to the front and rear identification words, and correcting the initial result;
Step 4) multi-material lexical segmentation, namely according to the result of the step 2), substituting K recognition results of the single word into texts to respectively carry out grammar segmentation and save;
Step 5) screening K different grammar segmentation results according to the prior knowledge of the grammar for single word after word segmentation based on grammar matching of a Chinese grammar library, and then selecting the result with the highest probability value in step 1) for correction;
Step 6) after processing the single word and the multi-word respectively, outputting the recognition result.
2. The post-OCR processing method based on word matching and grammar matching according to claim 1, wherein in the step 1), the method of the pre-OCR single word recognition module is:
Let x= (X 1,x2,...,xn) in the input text and y= (Y 1,y2,...,yn) in the output result of the single word recognizer, where Y i is the result of recognition of the single word X i in the input text X, and Y i={(yi_1,pi_1),(yi_2,pi_2),...,(yi_K,pi_K) includes the first K result sets with the largest probability values in the recognition classification result, and each tuple includes the recognition result and the corresponding probability.
3. The OCR post-processing method based on word matching and grammar matching according to claim 1, wherein the word segmentation tool in step 2) adopts jieba word segmentation, the jieba word segmentation algorithm uses a prefix dictionary-based efficient word graph scanning to generate a Directed Acyclic Graph (DAG) composed of all possible word generation conditions of Chinese characters in sentences, and then adopts dynamic programming to find out a maximum segmentation combination based on word frequency, and for non-logged words, an HMM model based on word formation capability of Chinese characters is adopted, and a Viterbi algorithm is used.
4. The OCR post-processing method based on word matching and grammar matching according to claim 1, wherein in the step 3), the specific flow of the forward maximum word matching based on the chinese dictionary is:
(a) Assuming that n words are included in the multi-word, each word has K recognition results, obtaining K n multi-word combinations after random combination, multiplying corresponding word probabilities in the word combinations and normalizing to obtain word combination recognition probability P1= { P' 1,p'2,...,p'K^n }, then putting the multi-word combinations into a corpus for matching, counting the occurrence times of each word in the corpus, marking the occurrence times as { c 1,c2,...,cK^n }, thus obtaining the occurrence probability of the multi-word and normalizing to obtain the result of P2= { c 1/sum{ci},c2/sum{ci},...,cK*n/sum{ci };
(b) And combining the word combination recognition probability with the occurrence probability of the multi-word, adding a weight factor alpha, wherein the final word combination probability is P=alpha P1+ (1-alpha) P2, and the word combination corresponding to the highest probability value is combined into a final result.
5. The OCR post-processing method based on word matching and grammar matching according to claim 1, wherein in the step 4), the method of multi-language lexical segmentation is as follows:
according to the word segmentation result, the multi-word which needs to be subjected to lexical segmentation is the updated result of the step 3), and the single-word takes the K recognition results obtained in the step 1) to respectively carry out the lexical segmentation.
6. The post-OCR processing method based on word matching and grammar matching according to claim 1, wherein in the step 5), the specific flow of grammar matching based on the chinese grammar library is:
The word to be identified is taken as a center, the word to be detected is identified through grammar matching check with the context, grammar matching mainly utilizes grammar semantic knowledge provided by each word library in the corpus, and single word identification results which do not accord with grammar matching rules are deleted and then the identification results with the highest probability are taken for updating single words.
CN202110567957.4A 2021-05-24 2021-05-24 OCR post-processing method based on word matching and grammar matching Active CN113269192B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110567957.4A CN113269192B (en) 2021-05-24 2021-05-24 OCR post-processing method based on word matching and grammar matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110567957.4A CN113269192B (en) 2021-05-24 2021-05-24 OCR post-processing method based on word matching and grammar matching

Publications (2)

Publication Number Publication Date
CN113269192A CN113269192A (en) 2021-08-17
CN113269192B true CN113269192B (en) 2024-04-30

Family

ID=77232591

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110567957.4A Active CN113269192B (en) 2021-05-24 2021-05-24 OCR post-processing method based on word matching and grammar matching

Country Status (1)

Country Link
CN (1) CN113269192B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114694152B (en) * 2022-04-01 2023-03-24 江苏行声远科技有限公司 Printed text credibility fusion method and device based on three-source OCR (optical character recognition) result

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034147A (en) * 2018-09-11 2018-12-18 上海唯识律简信息科技有限公司 Optical character identification optimization method and system based on deep learning and natural language
CN109582972A (en) * 2018-12-27 2019-04-05 信雅达系统工程股份有限公司 A kind of optical character identification error correction method based on natural language recognition
CN110751234A (en) * 2019-10-09 2020-02-04 科大讯飞股份有限公司 OCR recognition error correction method, device and equipment
CN111046627A (en) * 2018-10-12 2020-04-21 北京金山办公软件股份有限公司 Chinese character display method and system
CN111832299A (en) * 2020-07-17 2020-10-27 成都信息工程大学 Chinese word segmentation system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034147A (en) * 2018-09-11 2018-12-18 上海唯识律简信息科技有限公司 Optical character identification optimization method and system based on deep learning and natural language
CN111046627A (en) * 2018-10-12 2020-04-21 北京金山办公软件股份有限公司 Chinese character display method and system
CN109582972A (en) * 2018-12-27 2019-04-05 信雅达系统工程股份有限公司 A kind of optical character identification error correction method based on natural language recognition
CN110751234A (en) * 2019-10-09 2020-02-04 科大讯飞股份有限公司 OCR recognition error correction method, device and equipment
CN111832299A (en) * 2020-07-17 2020-10-27 成都信息工程大学 Chinese word segmentation system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于NLP 的OCR 后处理方法;李文华等;软件导刊;20101015;第35~36页 *

Also Published As

Publication number Publication date
CN113269192A (en) 2021-08-17

Similar Documents

Publication Publication Date Title
Kissos et al. OCR error correction using character correction and feature-based word classification
KR102417045B1 (en) Method and system for robust tagging of named entities
CN112906392B (en) Text enhancement method, text classification method and related device
CN105279149A (en) Chinese text automatic correction method
CN110853625B (en) Speech recognition model word segmentation training method and system, mobile terminal and storage medium
CN109983473B (en) Flexible integrated recognition and semantic processing
CN114818668B (en) Name correction method and device for voice transcription text and computer equipment
KR20140056753A (en) Apparatus and method for syntactic parsing based on syntactic preprocessing
Jemni et al. Out of vocabulary word detection and recovery in Arabic handwritten text recognition
CN113948066A (en) Error correction method, system, storage medium and device for real-time translation text
US8504359B2 (en) Method and apparatus for speech recognition using domain ontology
CN116502628A (en) Multi-stage fusion text error correction method for government affair field based on knowledge graph
CN113221542A (en) Chinese text automatic proofreading method based on multi-granularity fusion and Bert screening
Hládek et al. Learning string distance with smoothing for OCR spelling correction
CN113269192B (en) OCR post-processing method based on word matching and grammar matching
CN116127015A (en) NLP large model analysis system based on artificial intelligence self-adaption
Fusayasu et al. Word-error correction of continuous speech recognition based on normalized relevance distance
CN113420766B (en) Low-resource language OCR method fusing language information
CN113761903A (en) Text screening method for high-volume high-noise spoken short text
Kinaci Spelling correction using recurrent neural networks and character level n-gram
CN112447172A (en) Method and device for improving quality of voice recognition text
CN112711944B (en) Word segmentation method and system, and word segmentation device generation method and system
CN113239683A (en) Method, system and medium for correcting Chinese text errors
CN113609849A (en) Mongolian multi-mode fine-grained emotion analysis method fused with priori knowledge model
Pal et al. Vartani Spellcheck--Automatic Context-Sensitive Spelling Correction of OCR-generated Hindi Text Using BERT and Levenshtein Distance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant