CN113836971B

CN113836971B - Visual information reproduction method, system and storage medium after image type scanning piece identification

Info

Publication number: CN113836971B
Application number: CN202010580263.XA
Authority: CN
Inventors: 翟晓刚
Original assignee: China Life Insurance Asset Management Co ltd
Current assignee: China Life Insurance Asset Management Co ltd
Priority date: 2020-06-23
Filing date: 2020-06-23
Publication date: 2023-12-29
Anticipated expiration: 2040-06-23
Also published as: CN113836971A

Abstract

The invention relates to the field of document processing, and discloses a visual information reproduction method, a visual information reproduction system and a storage medium after image type scanning piece recognition.A user loads an image type scanning piece recognition content visual information recovery system, uploads an image type scanning piece PDF to be subjected to visual information reproduction, establishes a comparison table of word width pixels, line spacing pixels and font size and line spacing size of a font library in a word document, recognizes text contents in all line detection areas through a visual information analysis algorithm based on computer vision and an OCR text recognition technology based on deep learning, counts the number of text in the line detection areas, calculates the average width and line spacing pixels of the text, compares the average width and line spacing pixels with the comparison table to obtain the font size and line spacing, further calculates paragraph visual information, determines a paragraph head, and finally outputs an editable word document according to the font size, the line spacing and the paragraph visual information, and the uploaded information by the user has confidentiality, is safe and easy to operate, and the PDF conversion format is realized rapidly.

Description

Visual information reproduction method, system and storage medium after image type scanning piece identification

Technical Field

The present invention relates to the field of document processing, and in particular, to a method, a system, and a storage medium for reproducing visual information after image-type scan piece recognition.

Background

PDF (Portable Document Format) is a common electronic file format, has higher universality and compatibility in multi-type operating systems, and can ensure that data information is not modified or changed due to the coding type in the file transmission process, so that PDF is taken as a main stream form of file information transmission. PDF files can prevent others from inadvertently touching the keyboard to modify the file contents, but at the same time also create inconvenient results for modification, and are difficult to convert to other file formats.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a visual information reproduction method, a visual information reproduction system and a storage medium after image scanning piece identification.

In order to solve the technical problems, the invention provides the following technical scheme:

a visual information reproduction method after image type scanning piece identification comprises the following steps:

s1: establishing a comparison table of word width pixels, line spacing pixels and font library fonts and line spacing sizes in word documents;

s2: uploading the PDF of the image scanning piece;

s3: cutting the PDF of the image type scanning piece page by page into a picture format and preprocessing the picture;

s4: calculating the position information of the text line areas of all cut pictures of the image type scanning piece PDF through a text line detection algorithm based on deep learning and computer vision technology, namely calculating the word width pixels of each word in the line areas and the initial coordinate information and the end coordinate information of all text line areas;

s5: recognizing the text content in all the row detection areas by an OCR text recognition technology based on deep learning;

s6: calculating the number of punctuations included in the characters in the travel detection area;

s7: calculating the average width pixels of the characters according to the line width and the line character number calculated by the starting coordinate and the ending coordinate information of the line area of the characters, comparing the average width pixels of the characters with the word width pixels calculated in the step S4, and taking a smaller value as a final word width pixel value;

s8: the pixel value of the word width obtained in the step S7 is brought into a comparison table to obtain a corresponding font and a font size, and the font, the font size and the line detection area position information are matched and correspond;

s9: calculating line interval pixels according to the initial coordinate information and the end coordinate information of the text line area, and substituting the line interval pixels into a comparison table to obtain corresponding line interval size, and simultaneously calculating paragraph visual information to determine whether the paragraph is a paragraph head;

s10: and outputting the editable word document according to the fonts, the word sizes, the line spacing and the paragraph visual information.

Further, the step S1 includes: and establishing the corresponding relation between all the word width pixels and the line spacing pixels and the common fonts, word sizes and line spacing sizes in word.

Further, the step S2 includes: and executing a localization encryption program when uploading the image type scanning piece PDF.

Further, the step S3 of preprocessing includes: seal removal, tilt correction, noise removal, etc. are used.

Further, the step S4 includes: the image type scanning piece PDF is a long text, page-by-page analysis processing is carried out on the long text image type scanning piece picture, text line region detection and line region positioning are achieved, and the starting coordinate information and the ending coordinate information of each line region are calculated through analysis.

Further, the step S9 includes: and determining line space pixels by calculating the line width difference between the starting coordinate information of the high-order line and the starting coordinate information of the low-order line according to the starting coordinate information and the ending coordinate information of the adjacent line areas, and determining paragraph visual information by calculating the line width difference between the starting coordinate information of the high-order line and the starting coordinate information of the low-order line.

Further, the step S9 determines that the segment head: and calculating the line width difference value of the starting coordinates of the adjacent lines to be about 2 times of the width pixels of the characters by calculating the S7, namely marking the line as the head of the segment, and leaving two grids in front of the line.

The invention provides a visual information reproduction method after image scanning piece recognition, which is characterized in that a comparison table of fonts, word sizes and line spacing is searched through word width pixels in the steps S4 to S8 to obtain corresponding word sizes, a segment head is calculated through the step S9, and 2 characters in the segment head space are analyzed.

The invention provides a visual information reproduction system after image type scanning piece identification, which is loaded on a local CPU server and used by multiple users in parallel, wherein the visual information reproduction system after image type scanning piece identification is an image type scanning piece identification content visual information recovery system and comprises the following components:

a memory for storing executable instructions;

and the processor is used for realizing the visual information reproduction method after the image scanning piece is identified when the executable instructions stored in the memory are operated.

The invention provides a computer readable storage medium storing executable instructions which when executed by a processor implement the visual information reproduction method after the identification of the image scanning piece according to any one of the above.

The visual information reproduction method, the visual information reproduction system and the storage medium after the image type scanning piece is identified can realize that multiple users can use the system to upload the image type scanning piece PDF file to develop the visual information reproduction system in a concurrent mode, the system adopts a visual information analysis algorithm based on computer vision to analyze the visual information such as fonts, typesetting patterns and the like corresponding to the PDF content of the image type scanning piece, and outputs an editable word document of the corresponding visual information.

Drawings

FIG. 1 is a schematic diagram of an embodiment of the present invention.

FIG. 2 is a flow chart of an embodiment of the present invention.

Fig. 3 is a schematic diagram of a text line area detection result according to an embodiment of the present invention.

Fig. 4 is a PDF file and a word file before and after processing by the visual information restoring system of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1 and 2, the present invention provides a visual information reproduction system after image type scan piece identification, which is loaded on a local CPU server and is used concurrently by multiple users.

The processing steps comprise the following steps:

firstly, establishing a comparison table of word width pixels, line spacing pixels and the font size of a font library in a word document;

and establishing the corresponding relation between all the word width pixels and the line spacing pixels and the common fonts, word sizes and line spacing sizes in word.

Secondly, uploading the image type scanning piece PDF to a content visual information recovery system, and obtaining the image type scanning piece PDF by the system;

the system performs a localized encryption procedure during acquisition of the image scanner PDF.

Thirdly, cutting the PDF of the image type scanning piece page by page into a picture format and preprocessing the picture;

preprocessing for reducing interference factors of PDF obstruction of the acquired image-type scanner is performed, and the preprocessing step includes the use of seal removal, tilt correction, noise removal, and the like.

Then, the position information of the text line areas of all cut pictures of the image type scanning part PDF is calculated through a text line detection algorithm based on deep learning and computer vision technology, namely the word width L is calculated _(i) And the start coordinate information and the end coordinate information of all the text lines, including line width W and line height H;

the image type scanning piece PDF is a long text, and page-by-page analysis processing is carried out on the long text image type scanning piece PDF, so that text line area detection and line area positioning are realized, and the initial coordinate information and the end coordinate information of each text line area are analyzed and calculated;

the line position information of the same line is that the initial sitting mark of the text line area is P ₀ [w,h]Text line region end-of-line marker P ₁ [w,h]，P ₀ And P ₁ Is not identical in position, P ₀ Refer to the initial position of each line of text region, P ₁ Refers to the end position of the area position of each line of characters, and the width of each line of characters is P ₁ Width value minus P ₀ Width value of each row of characters isP ₀ P is subtracted from the height value of (2) ₁ Height value of (i.e. row width w=wp) ₁ -Wp ₀ The line height is H=Hp ₀ -Hp ₁ 。

Then recognizing the text content in all text line detection areas by an OCR text recognition technology based on deep learning;

OCR (Optical Character Recognition ) refers to the process of an electronic device (e.g., a scanner or digital camera) checking characters printed on paper, determining their shape by detecting dark and light patterns, and then translating the shape into computer text using a character recognition method; that is, the technology of converting the characters in the paper document into the image file of black-white lattice by optical mode and converting the characters in the image into the text format by the recognition software for further editing and processing by the word processing software is adopted.

Further, the number of punctuation marks included in the characters in the travel detection area is calculated and recorded as SUM;

calculating the average width pixels of the characters according to the obtained line width W and the line character number SUM, and marking the average width pixels of the characters asI.e. < ->And the average width pixel of the characters is compared with the calculated character width L _(i) Comparing, and taking the smaller value as the final word width pixel value, namely, finally determining the word width pixel value as follows: />

Further bringing the obtained word width pixel value L into a comparison table to obtain a corresponding font and a font size, and matching and corresponding the font with the position information of the line detection area;

calculating line interval pixels according to the initial coordinate information and the end coordinate information of the text lines and substituting the line interval pixels into a comparison table to obtain corresponding line interval sizes, and simultaneously calculating paragraph visual information to determine whether the text lines are the first segment;

the method comprises the steps of firstly determining start coordinate information and end coordinate information of adjacent line areas, determining line space pixels by calculating the line height difference value of the end coordinate information of the high-order line and the start coordinate information of the low-order line, and then determining paragraph visual information by calculating the line width difference value of the start coordinate information of the high-order line and the low-order line.

For example, as shown in FIG. 3, the start coordinate information and the end coordinate information of the current text adjacent line are obtained, wherein the coordinates are as followsFor the start and end coordinates of the upper row, < >> Starting and ending coordinates of the lower row, as can be seen from fig. 3, +.>And->Is not identical in position,/>Refer to the starting coordinate position of the row, +.>Refer to the end coordinate position of each row, the width pixel of each row of characters is +.>Is subtracted by +.>The width pixel value of each row of characters is +.>Subtracting +.>Is a high pixel value of (1);

the row-spacing pixels of adjacent rows are calculated by:

corresponding the calculation result with a comparison table to determine the text line spacing;

calculating the width difference value of the starting coordinates of the adjacent lines by the following formula to obtain paragraph visual information;

from the above equation: the calculated width difference of adjacent lines is a positive number, and the positive number may be an integer value or a non-integer value, i.e. if the calculated width difference is approximately equal to the average width pixel of the textAnd 2 times, namely the head of the segment, and two grids are left in front of the row.

As shown in fig. 4, the image-type scanplan PDF outputs an editable word document according to the font, the font size, the line spacing and the paragraph visual information through a text line detection algorithm based on the deep learning and the computer visual technology and an OCR text recognition technology based on the deep learning, and a series of calculations, the left text in fig. 4 is an image-type scanplan PDF source file, and the right text in fig. 4 is the word document content after visual information recovery.

In the invention, the corresponding fonts and font sizes are obtained by looking up the comparison tables of the fonts and font sizes through the font width pixels in the steps S4 to S8, the segment head is calculated through the step S9, and the average width pixels of the 2 characters in the segment head space are analyzed.

The invention provides a visual information reproduction system after image type scanning piece identification, which is loaded on a local CPU server and is used by multiple users in parallel, wherein the visual information reproduction system after image type scanning piece identification is an image type scanning piece identification content visual information recovery system and comprises the following components:

a memory for storing executable instructions;

The invention also provides a computer readable storage medium storing executable instructions which when executed by a processor implement the visual information reproduction method after the image scanning piece is identified according to any one of the above.

The invention establishes a comparison table of word width pixels, line spacing pixels and font size of a font library in a word document by loading an image type scanning piece identification content visual information recovery system on a local CPU server, a user uploads an image type scanning piece PDF to be reproduced with visual information, a security program is executed by the system in the process of uploading the image type scanning piece PDF, then the image type scanning piece PDF is preprocessed, and then the word width L is calculated by adopting a visual information analysis algorithm based on computer vision and an OCR character identification technology based on deep learning by adopting the system _(i) And line width W and line height H, identifying the text content in all text line areas, further calculating the number SUM of the text in the line detection, and then calculating the average width pixels of the textFinally, word width pixels +.>And the word size is carried out with a comparison tableCorresponding fonts and font sizes are obtained; calculating line interval pixels by acquiring the starting coordinates and the ending coordinates of adjacent lines of the current text, and determining the line interval pixels by performing line interval correspondence with a comparison table; then calculating the paragraph visual information, calculating the width difference of the starting coordinates of the adjacent lines by the formula to obtain the calculation result of the paragraph visual information, and if the calculation result is about the average width pixel of the characters>About 2 times, namely marked as the head of the segment, two lattices in front of the row; finally, the editable word document is output according to the fonts, the word sizes, the line spacing and the paragraph visual information, so that the purpose of the invention is achieved.

The image type scanning piece identification content visual information recovery system supports multi-user concurrent uploading, does not affect each other, and the uploaded file has confidentiality and cannot leak user data.

The foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes or direct or indirect application in other related technical fields are included in the scope of the present invention.

Claims

1. A visual information reproduction method after image type scanning piece identification is characterized in that: the method comprises the following steps:

s2: uploading the PDF of the image scanning piece;

2. The method for reproducing visual information after recognition of an image-type scanner according to claim 1, wherein: the step S1 includes: and establishing the corresponding relation between all the word width pixels and the line spacing pixels and the common fonts, word sizes and line spacing sizes in word.

3. The method for reproducing visual information after recognition of an image-type scanner according to claim 1, wherein: the step S2 includes: and executing a localization encryption program when uploading the image type scanning piece PDF.

4. The method for reproducing visual information after recognition of an image-type scanner according to claim 1, wherein: the step S3 of preprocessing comprises the following steps: stamp removal, tilt correction, noise removal are used.

5. The method for reproducing visual information after recognition of an image-type scanner according to claim 1, wherein: the step S4 includes: the image type scanning piece PDF is a long text, page-by-page analysis processing is carried out on the long text image type scanning piece picture, text line region detection and line region positioning are achieved, and the starting coordinate information and the ending coordinate information of each line region are calculated through analysis.

6. The method for reproducing visual information after recognition of an image-type scanner according to claim 1, wherein: the step S9 includes: and determining line space pixels by calculating the line width difference between the starting coordinate information of the high-order line and the starting coordinate information of the low-order line according to the starting coordinate information and the ending coordinate information of the adjacent line areas, and determining paragraph visual information by calculating the line width difference between the starting coordinate information of the high-order line and the starting coordinate information of the low-order line.

7. The method for reproducing information from an image-type scanner as defined in claim 6, wherein: the step S9 determines the segment head: and (3) calculating 2 times of the width pixels of the characters by calculating the line width difference value of the initial coordinates of the adjacent lines, namely marking the line as the head of the segment and two blank grids in front of the line.

8. A visual information reproducing method after recognition of an image type scanner according to any one of claims 1 to 7, wherein: and searching a comparison table of fonts, word sizes and line spacing through the word width pixels in the steps S4 to S8 to obtain corresponding word sizes, calculating segment heads through the step S9, and analyzing 2 characters in the segment head space.

9. The visual information reproduction system after the image scanning piece is identified is loaded on a local CPU server, and the system is used by multiple users in parallel, and is characterized in that: the visual information reproduction system after the image type scanning piece identification is an image type scanning piece identification content visual information recovery system, comprising: a memory for storing executable instructions; a processor for implementing the visual information reproduction method after recognition of an image-type scanner according to any one of claims 1 to 8 when executing the executable instructions stored in the memory.

10. A computer-readable storage medium storing executable instructions that when executed by a processor implement the method for reproducing visual information after recognition of an image-type scanner according to any one of claims 1 to 8.