CN114494679A

CN114494679A - Double-layer PDF generation and correction method and device

Info

Publication number: CN114494679A
Application number: CN202111504588.0A
Authority: CN
Inventors: 王东云; 李丽芬; 孙凡; 丁毅
Original assignee: SHANGHAI PRECISION METROLOGY AND TEST RESEARCH INSTITUTE
Current assignee: SHANGHAI PRECISION METROLOGY AND TEST RESEARCH INSTITUTE
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2022-05-13

Abstract

The invention provides a double-layer PDF generation and correction method and a device, which aim at the recognition result of an OCR engine to carry out secondary processing, arrange the recognition result into a paragraph with complete logic, carry out error detection and error correction on the paragraph sentence from two aspects of word granularity and word granularity by taking the paragraph as a unit, and finally further generate a double-layer PDF document according to the error correction result. The device mainly comprises an OCR recognition engine, a storage module, a paragraph synthesis module, an error detection and correction module and an output module.

Description

Double-layer PDF generation and correction method and device

Technical Field

The invention relates to the technical field of computer information processing, in particular to a double-layer PDF generation, production and proofreading related technology.

Background

A dual-layer PDF (Portable Document Format) file is a PDF Format file having a multilayer structure, in which the file content includes both a text layer and an image layer, and the positions of the text layer and the image layer are in one-to-one correspondence. On the basis of scanning PDF, the method is characterized in that the scanned image is subjected to decontamination, deviation correction and OCR (optical character recognition, namely, the scanned image is recognized into characters by software), then the characters obtained by the OCR are made into a transparent character layer and are covered on an original scanned image layer, and the method is called as 'double-layer PDF'. Such a PDF is also referred to as a "searchable scan PDF" because it can be retrieved, copied, and exported in text, as compared to a pure scan version PDF. Thus, an index database can be established based on characters for scientific management.

However, the OCR recognition result has a certain error rate, the OCR effect is not good, and a certain word is often found on the scanned PDF but not found during searching; or the content copied and pasted from the double-layer PDF has more wrong characters. This is all caused by inaccuracy of OCR recognition results.

Disclosure of Invention

The invention aims to provide a method and a device for generating and checking double-layer PDF, which are used for improving the accuracy of double-layer PDF text.

In order to achieve the above object, the present invention provides a method for generating and checking a double-layer PDF, which performs secondary processing on an OCR engine recognition result, arranges the recognition result into a logically complete paragraph, performs error detection and error correction on the paragraph sentence from two aspects of word granularity and word granularity by using the paragraph as a unit, and further generates a double-layer PDF document according to an error correction result.

The double-layer PDF generation and correction method comprises the following steps: 1) the OCR recognition engine module recognizes the picture and outputs a recognition result; 2) filtering the recognition result to obtain recognition result meta information; the identification result meta information comprises text block contents of all text blocks, text block circumscribed rectangular coordinates and text block scores; 3) identifying text blocks and synthesizing text paragraphs; combining the text blocks into a text paragraph according to the circumscribed rectangular coordinates of the text blocks to generate a new text paragraph and circumscribed rectangular coordinates of the paragraph; 4) carrying out error detection on the text paragraphs; 4.1) carrying out error detection, namely cutting words by a Chinese word segmentation device, detecting errors from two aspects of word granularity and word granularity, and integrating suspected error results of the two granularities to form a suspected error position candidate set; 4.2) the error detection result database is persisted, and error position information is provided for subsequent manual intervention error correction; 4.3) reading error detection results, traversing all suspected error positions, replacing words at the error positions by using a similar pronunciation and a similar shape dictionary, then calculating sentence confusion degree through a language model, comparing and sequencing replacement results of all suspected error positions in a candidate set, and obtaining an optimal corrected word; 4.4) the error correction result and the sequencing information database are persisted, and an error correction suggestion is provided for the subsequent manual intervention error correction; 5) correcting errors; including automatic error correction and/or human intervention error correction; 6) and generating a double-layer PDF file according to the error correction result.

The double-layer PDF generation and correction method comprises the following steps: the step 2) comprises the following steps: 2.1) obtaining document MD5 value: reading the document content according to the identification picture document path, and calculating the document MD5 value according to the document content; 2.2) the recognition result is a json array, filtering the recognition result, acquiring the text block contents, the external rectangular coordinates of the text blocks and the text block scores of all the text blocks, and storing the meta information of the recognition result into a warehouse by taking the document MD5 value acquired in the step 2.1 as a main key; and (3) establishing a coordinate system by taking the upper left corner of a circumscribed rectangle of a certain text block as an origin (0, 0), taking a line extending rightwards as an X axis and a line extending downwards as a Y axis, and expressing the coordinates of the circumscribed rectangle of the text block after filtering by using the coordinates of the upper left corner and the lower right corner of the rectangle under the coordinate system.

The double-layer PDF generating and correcting method includes the following steps of 3): 3.1) finding the X value of the coordinate of the upper left corner of the circumscribed rectangle of the leftmost text block, and recording as X1; 3.2) finding the X value of the coordinate of the upper left corner of the circumscribed rectangle of the rightmost text block, and recording as X2; 3.3) traversing all recognition results, taking the X value of the coordinate of the upper left corner of the circumscribed rectangle of each text block, assembling the text blocks with the X values between X1 and X2 into a line under the condition that the Y values are the same, and sequencing all the lines in ascending order according to the Y values of the coordinate of the upper left corner of the circumscribed rectangle of the text blocks; 3.4) finding the row with the paragraph head; 3.5) assembling lines into the paragraph according to the typesetting style of the paragraph document and the principle that the first character of the paragraph is indented by two.

The double-layer PDF generation and correction method comprises the following steps: in the step 3.4), traversing all lines, calculating the difference between the X value of the coordinates of the upper left corner of the circumscribed rectangle of the starting text block of each line and X1, if the difference is nonzero, identifying the paragraph head of the line, otherwise, identifying the normal line of the line.

In the above double-layer PDF generating and checking method, in step 3.5), all lines are traversed, and if the current line is a paragraph head, the traversal is continued backwards until the next line is a paragraph head, all the lines are assembled into a paragraph, and then the next round of loop is restarted until all the lines are processed.

In the above double-layer PDF generating and checking method, in step 5), if the system sets an automatic error correction program, the system automatically corrects the detected error.

In the above double-layer PDF generating and correcting method, in step 5), if the system sets manual intervention text error correction, a correction page is displayed for manual error correction; the correction page is divided into two parts, one part is an original picture, the other part is a document recovered based on the external rectangular coordinates of the recognized text and the external rectangular coordinates of the detected error text, wherein the region containing the detected error text can be highlighted in different colors, so that a user can immediately find and check the error text region, when the user clicks the error text region, the system can give an error correction suggestion, and when the user double clicks the error correction suggestion, the error text region can be updated; when the user clicks the error text area, the corresponding coordinate area of the original drawing is highlighted, so that the user can conveniently compare the original drawing with the identification result in the area.

The double-layer PDF generating and correcting method includes the following steps in step 6): 6.1) updating all the text paragraph contents and the circumscribed rectangular coordinates of the text paragraphs which are correctly modified by the user to a database for persistence; 6.2) reading all latest text paragraph contents and text paragraph circumscribed rectangle coordinates of the current identification document, and combining the system to set the size of characters to output a double-layer PDF file.

The invention provides another technical scheme which is a double-layer PDF generation and correction device, comprising the following modules: the OCR recognition engine module is used for recognizing the picture and outputting a recognition result; the storage module is used for storing the identification result meta information and the error detection result; the paragraph synthesis module is used for combining all text blocks into a text paragraph according to the content of all the text blocks of the OCR recognition document and the circumscribed rectangular coordinates of the text block, and generating a new text paragraph and the circumscribed rectangular coordinates of the paragraph; the error detection and correction module is used for applying Chinese lexical analysis to the synthesized text paragraphs to detect the text paragraphs with lexical errors, storing related text paragraph information and further correcting the text paragraphs with the lexical errors; and the output module is used for outputting the double-layer PDF file according to the original picture, the content of the identified text paragraph, the circumscribed rectangular coordinate of the identified text paragraph and the font size.

Compared with the prior art, the invention has the beneficial technical effects that:

in the OCR recognition process, a small amount of character recognition errors exist, the traditional proofreading mode only can be realized by manually reading original texts and checking the original texts a little, the position of the document itself with errors cannot be determined before the proofreading, and the manual error checking and proofreading mode is low in efficiency. The method and the device for generating and correcting the double-layer PDF have the advantages that: performing secondary processing on a text output by the OCR, performing Chinese lexical analysis, automatically checking and identifying errors in the text, supporting manual intervention checking and correction, displaying an original text and an identification result in a visual comparison mode, and enabling the text identification errors to be clear at a glance; meanwhile, related modification suggestions can be given according to error detection results, and the correction efficiency is greatly improved; and finally, outputting the double-layer PDF according to the proofreading result, so that the output double-layer PDF file and the proofreading effect of the device reach the what you see is what you get effect, and the proofreading efficiency and the output precision are improved.

Drawings

Fig. 1 is a flowchart of a method for generating and calibrating a dual-layer PDF according to an embodiment of the present invention.

Detailed Description

The method and apparatus for generating and calibrating a dual-layer PDF according to the present invention will be described in further detail with reference to FIG. 1.

The double-layer PDF generation and correction method of the invention carries out secondary processing aiming at the recognition result of an OCR engine, arranges the recognition result into a paragraph with complete logic, carries out error detection and error correction aiming at the paragraph sentence from two aspects of word granularity and word granularity by taking the paragraph as a unit, and finally further generates a double-layer PDF document according to the correction result.

Referring to fig. 1, the method for generating and calibrating a dual-layer PDF of the present embodiment includes the following steps:

step 1: the OCR recognition engine module recognizes the picture;

inputting a picture to be recognized into an OCR recognition engine module, calling the trained Chinese, letter and digital models by the OCR recognition engine module to perform detection, recognition and other processing, and outputting a recognition result;

step 2: storing the identification result meta information;

filtering the recognition result and persisting the database, wherein the step can be divided into two steps:

step 2.1: obtain document MD5 value: reading the document content according to the identified picture document path, and calculating the value of the document MD5 according to the document content, wherein the value can be used as a main key for storing all subsequent information;

step 2.2: the recognition result is a json array, the recognition result is filtered, the text block contents, the circumscribed rectangular coordinates of the text blocks and the text block scores of all the text blocks are obtained, the information elements are called recognition result meta-information, and the recognition result meta-information is stored and put in storage by taking the document MD5 value obtained in the step 2.1 as a main key;

taking the upper left corner of a certain text block circumscribed rectangle as an origin (0, 0), taking a rightward extension line as an X axis, taking a downward extension line as a Y axis, establishing a coordinate system, and expressing the coordinates of the text block circumscribed rectangle after filtering by using the coordinates of the upper left corner and the lower right corner of the rectangle under the coordinate system;

and step 3: identifying text blocks and synthesizing text paragraphs;

combining the text blocks into a text paragraph according to the circumscribed rectangular coordinates of the text blocks, and generating a new text paragraph and circumscribed rectangular coordinates of the paragraph, wherein the step is to make the recognized text blocks form a whole with semantic relevance, so as to prepare for the next error detection, and specifically, the method comprises the following steps:

step 3.1: finding the X value of the coordinate of the upper left corner of the circumscribed rectangle of the leftmost text block, and recording the X value as X1;

the recognition result array is arranged in an ascending order, the first 20 items in the recognition result are obtained (if the recognition result is less than 20 items, all the items are taken), the minimum value of X is found out by scanning the data set (the data set consisting of the first 20 items), and the value is the X value of the coordinate of the upper left corner of the circumscribed rectangle of the leftmost text block;

step 3.2: finding the X value of the coordinate of the upper left corner of the circumscribed rectangle of the rightmost text block, and recording the X value as X2;

the recognition result array is arranged in an ascending order, the last 20 items in the recognition result are obtained (if the recognition result is less than 20 items, all the items are taken), the maximum value of X is found out by scanning the data set, and the value is the X value of the coordinate of the upper left corner of the circumscribed rectangle of the rightmost text block;

step 3.3: assembling text blocks into lines;

traversing all recognition results, taking the X value of the upper left corner coordinate of the circumscribed rectangle of each text block, assembling the text blocks with the X values between the leftmost value X1 and the rightmost value X2 into a line under the condition that the Y values are the same, storing all the data of the same line into a line array, and sorting the line array in ascending order according to the Y values of the upper left corner coordinate of the circumscribed rectangle of the text blocks;

step 3.4: finding lines where paragraph headers exist;

traversing all lines, calculating the difference between the X value of the starting text block of each line and the leftmost value X1 obtained in the step 3.1, if the difference value is nonzero, identifying the paragraph head of the line, otherwise, identifying the normal line of the line;

step 3.5: assembling lines into paragraphs according to the principle that the first character of the paragraph can be indented into two according to the layout style of the paragraph document;

traversing all lines, if the paragraph head of the current line is followed, continuing traversing backwards until the paragraph head of the next line, assembling all the lines into a paragraph, and then restarting the next round of circulation until all the lines are processed;

and 4, step 4: carrying out error detection on the text paragraphs;

text error correction is carried out on the synthesized text paragraphs, and detection results are obtained, wherein the step mainly comprises the step of further improving the accuracy of text recognition; if the system sets automatic text correction, the system can automatically correct the detected error place; the method comprises the following steps:

step 4.1: error detection;

the method comprises the steps of firstly, carrying out error detection on words by a Chinese word segmentation device, and if a sentence of a paragraph contains wrongly-written characters, carrying out error segmentation on word segmentation results, so that errors can be detected from two aspects of word granularity and word granularity, and integrating suspected error results of the two granularities to form a suspected error position candidate set;

step 4.2: the error detection result is stored in a database for persistence, and error position information is provided for subsequent manual intervention error correction;

step 4.3: error correction;

the error correction is to read the error detection result from the system, traverse all suspected error positions, replace words in the error positions by using similar dictionaries, calculate sentence confusion degree by a language model, compare and sort replacement results of all candidate sets to obtain the optimal corrected words;

step 4.4: the error correction result and the sequencing information database are persisted, and an error correction suggestion is provided for the subsequent manual intervention error correction;

and 5: text proofreading;

if the system sets manual intervention text error correction, displaying a correction page for manual error correction; the correction page is divided into two parts, one part is an original picture, the other part is a document recovered based on the external rectangular coordinates of the recognized text and the external rectangular coordinates of the detected error text, wherein the region containing the detected error text can be highlighted in different colors, so that a user can immediately find and check the error text region, when the user clicks the error text region, the system can give an error correction suggestion, and when the user double clicks the error correction suggestion, the error text region can be updated; when a user clicks the error text area, the corresponding coordinate area of the original image is highlighted, so that the user can conveniently compare the original image with the identification result in the area;

if the system sets automatic text error correction, the system can automatically correct the detected error places, step 5 can be omitted, or the system sets automatic text error correction and text proofreading at the same time;

step 6: generating a double-layer PDF file;

outputting a double-layer PDF file according to all the contents of the text paragraphs, circumscribed rectangular coordinates of the text paragraphs and a set font size after the text paragraphs are checked by a user; the method specifically comprises the following steps:

step 6.1: updating all the text paragraph contents and the circumscribed rectangular coordinates of the text paragraphs which are correctly modified by the user to a database for persistence;

step 6.2: reading all latest text paragraph contents and circumscribed rectangle coordinates of the text paragraphs of the current identified document, and setting the size of characters by combining a system to output a double-layer PDF file.

The invention also discloses a double-layer PDF generation and correction device, which comprises the following modules:

the OCR recognition engine module is used for recognizing the picture based on the existing market mature technology and outputting a recognition result;

the storage module is mainly used for storing document identification result meta information and error detection result information which are required to be applied in document display, proofreading and double-layer PDF generation stages;

a paragraph synthesis module, which combines the texts in the adjacent rectangles of each text block of the OCR recognized document and the circumscribed rectangle thereof into a text paragraph according to the coordinate position relationship, and generates a new text paragraph and the circumscribed rectangle coordinates of the paragraph;

the error detection and correction module is used for applying Chinese lexical analysis to the synthesized text paragraphs to detect the text paragraphs with lexical errors, storing related text paragraph information and further correcting the text paragraphs;

and the output module is used for outputting the double-layer PDF file according to the original picture, the content of the identified text paragraph, the circumscribed rectangular coordinate of the identified text paragraph and the font size.

By the method, high-efficiency OCR correction and high-precision output of the double-layer PDF document are realized, the output double-layer PDF document and the correction effect of the device reach the what you see is what you get effect, and the correction efficiency and the output precision are improved.

It will be apparent to those skilled in the art that various changes and modifications may be made in the invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. The double-layer PDF generation and correction method is characterized in that secondary processing is carried out on recognition results of an OCR engine, the recognition results are arranged into logically complete paragraphs, error detection and error correction are carried out on the paragraphs and sentences from two aspects of word granularity and word granularity by taking the paragraphs as a unit, and finally a double-layer PDF document is further generated according to error correction results.

2. The dual-layer PDF generation and correction method of claim 1, comprising:

1) the OCR recognition engine module recognizes the picture and outputs a recognition result;

2) filtering the recognition result to obtain recognition result meta information;

the identification result meta information comprises text block contents of all text blocks, text block circumscribed rectangular coordinates and text block scores;

3) identifying text blocks and synthesizing text paragraphs;

combining the text blocks into a text paragraph according to the circumscribed rectangular coordinates of the text blocks to generate a new text paragraph and circumscribed rectangular coordinates of the paragraph;

4) carrying out error detection on the text paragraphs;

4.1) carrying out error detection, namely cutting words by a Chinese word segmentation device, detecting errors from two aspects of word granularity and word granularity, and integrating suspected error results of the two granularities to form a suspected error position candidate set;

4.2) the error detection result database is persisted, and error position information is provided for subsequent manual intervention error correction;

4.3) reading error detection results, traversing all suspected error positions, replacing words at the error positions by using a similar pronunciation and a similar shape dictionary, then calculating sentence confusion degree through a language model, comparing and sequencing replacement results of all suspected error positions in a candidate set, and obtaining an optimal corrected word;

4.4) the error correction result and the sequencing information database are persisted, and an error correction suggestion is provided for the subsequent manual intervention error correction;

5) correcting errors;

including automatic error correction and/or human intervention error correction;

6) and generating a double-layer PDF file according to the error correction result.

3. The dual-layer PDF generating and collating method according to claim 2, wherein said step 2) comprises:

2.1) obtaining document MD5 value: reading the document content according to the identification picture document path, and calculating the document MD5 value according to the document content;

2.2) the recognition result is a json array, filtering the recognition result, acquiring the text block contents, the external rectangular coordinates of the text blocks and the text block scores of all the text blocks, and storing the meta information of the recognition result into a warehouse by taking the document MD5 value acquired in the step 2.1 as a main key;

and (3) establishing a coordinate system by taking the upper left corner of a circumscribed rectangle of a certain text block as an origin (0, 0), taking a line extending rightwards as an X axis and a line extending downwards as a Y axis, and expressing the coordinates of the circumscribed rectangle of the text block after filtering by using the coordinates of the upper left corner and the lower right corner of the rectangle under the coordinate system.

4. The dual-layer PDF generating and collating method according to claim 3, wherein said step 3) comprises:

3.1) finding the X value of the coordinate of the upper left corner of the circumscribed rectangle of the leftmost text block, and recording as X1;

3.2) finding the X value of the coordinate of the upper left corner of the circumscribed rectangle of the rightmost text block, and recording as X2;

3.3) traversing all recognition results, taking the X value of the coordinate of the upper left corner of the circumscribed rectangle of each text block, assembling the text blocks with the X values between X1 and X2 into a line under the condition that the Y values are the same, and sequencing all the lines in ascending order according to the Y values of the coordinate of the upper left corner of the circumscribed rectangle of the text blocks;

3.4) finding the row with the paragraph head;

3.5) assembling the lines into the paragraph according to the typesetting style of the paragraph document and the principle that the first character of the paragraph is indented by two.

5. The method as claimed in claim 4, wherein in step 3.4), all lines are traversed, the difference between the X value of the coordinates of the upper left corner of the circumscribed rectangle of the starting text block of each line and X1 is calculated, if the difference is non-zero, the line paragraph header is identified, otherwise, the line normal line is identified.

6. The method as claimed in claim 5, wherein in step 3.5), all lines are traversed, and if the current line is a paragraph head, the traversal is continued backwards until the next line is a paragraph head, all the lines are assembled into a paragraph, and then the next round of loop is restarted until all the lines are processed.

7. The dual-layer PDF generating and correcting method according to claim 1, wherein in step 5), if the system sets an automatic error correction program, the system automatically corrects the detected error.

8. The dual-layer PDF generation and proofreading method of claim 1, wherein in the step 5), if the system sets manual intervention text error correction, a proofreading page is displayed for manual error correction; the correction page is divided into two parts, one part is an original picture, the other part is a document recovered based on the recognized text circumscribed rectangular coordinate and the detected error text circumscribed rectangular coordinate, wherein the region containing the detected error text can be highlighted in different colors, so that a user can immediately find and check the error text region, when the user clicks the error text region, the system can give an error correction suggestion, and the user can update the error text region by double clicking the error correction suggestion; when the user clicks the error text area, the corresponding coordinate area of the original drawing is highlighted, so that the user can conveniently compare the original drawing with the identification result in the area.

9. The dual-layer PDF generating and collating method according to claim 1, wherein said step 6) comprises:

10. The double-layer PDF generation and correction device is characterized by comprising the following modules:

the OCR recognition engine module is used for recognizing the picture and outputting a recognition result;

the storage module is used for storing the identification result meta information and the error detection result;

the paragraph synthesis module is used for combining all text blocks into a text paragraph according to the content of all the text blocks of the OCR recognition document and the circumscribed rectangular coordinates of the text block, and generating a new text paragraph and the circumscribed rectangular coordinates of the paragraph;

the error detection and correction module is used for applying Chinese lexical analysis to the synthesized text paragraphs to detect the text paragraphs with lexical errors, storing related text paragraph information and further correcting the text paragraphs with the lexical errors;