CN113836971B - Visual information reproduction method, system and storage medium after image type scanning piece identification - Google Patents

Visual information reproduction method, system and storage medium after image type scanning piece identification Download PDF

Info

Publication number
CN113836971B
CN113836971B CN202010580263.XA CN202010580263A CN113836971B CN 113836971 B CN113836971 B CN 113836971B CN 202010580263 A CN202010580263 A CN 202010580263A CN 113836971 B CN113836971 B CN 113836971B
Authority
CN
China
Prior art keywords
line
visual information
pixels
word
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010580263.XA
Other languages
Chinese (zh)
Other versions
CN113836971A (en
Inventor
翟晓刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Life Insurance Asset Management Co ltd
Original Assignee
China Life Insurance Asset Management Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Life Insurance Asset Management Co ltd filed Critical China Life Insurance Asset Management Co ltd
Priority to CN202010580263.XA priority Critical patent/CN113836971B/en
Publication of CN113836971A publication Critical patent/CN113836971A/en
Application granted granted Critical
Publication of CN113836971B publication Critical patent/CN113836971B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Character Input (AREA)

Abstract

The invention relates to the field of document processing, and discloses a visual information reproduction method, a visual information reproduction system and a storage medium after image type scanning piece recognition.A user loads an image type scanning piece recognition content visual information recovery system, uploads an image type scanning piece PDF to be subjected to visual information reproduction, establishes a comparison table of word width pixels, line spacing pixels and font size and line spacing size of a font library in a word document, recognizes text contents in all line detection areas through a visual information analysis algorithm based on computer vision and an OCR text recognition technology based on deep learning, counts the number of text in the line detection areas, calculates the average width and line spacing pixels of the text, compares the average width and line spacing pixels with the comparison table to obtain the font size and line spacing, further calculates paragraph visual information, determines a paragraph head, and finally outputs an editable word document according to the font size, the line spacing and the paragraph visual information, and the uploaded information by the user has confidentiality, is safe and easy to operate, and the PDF conversion format is realized rapidly.

Description

Visual information reproduction method, system and storage medium after image type scanning piece identification
Technical Field
The present invention relates to the field of document processing, and in particular, to a method, a system, and a storage medium for reproducing visual information after image-type scan piece recognition.
Background
PDF (Portable Document Format) is a common electronic file format, has higher universality and compatibility in multi-type operating systems, and can ensure that data information is not modified or changed due to the coding type in the file transmission process, so that PDF is taken as a main stream form of file information transmission. PDF files can prevent others from inadvertently touching the keyboard to modify the file contents, but at the same time also create inconvenient results for modification, and are difficult to convert to other file formats.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a visual information reproduction method, a visual information reproduction system and a storage medium after image scanning piece identification.
In order to solve the technical problems, the invention provides the following technical scheme:
a visual information reproduction method after image type scanning piece identification comprises the following steps:
s1: establishing a comparison table of word width pixels, line spacing pixels and font library fonts and line spacing sizes in word documents;
s2: uploading the PDF of the image scanning piece;
s3: cutting the PDF of the image type scanning piece page by page into a picture format and preprocessing the picture;
s4: calculating the position information of the text line areas of all cut pictures of the image type scanning piece PDF through a text line detection algorithm based on deep learning and computer vision technology, namely calculating the word width pixels of each word in the line areas and the initial coordinate information and the end coordinate information of all text line areas;
s5: recognizing the text content in all the row detection areas by an OCR text recognition technology based on deep learning;
s6: calculating the number of punctuations included in the characters in the travel detection area;
s7: calculating the average width pixels of the characters according to the line width and the line character number calculated by the starting coordinate and the ending coordinate information of the line area of the characters, comparing the average width pixels of the characters with the word width pixels calculated in the step S4, and taking a smaller value as a final word width pixel value;
s8: the pixel value of the word width obtained in the step S7 is brought into a comparison table to obtain a corresponding font and a font size, and the font, the font size and the line detection area position information are matched and correspond;
s9: calculating line interval pixels according to the initial coordinate information and the end coordinate information of the text line area, and substituting the line interval pixels into a comparison table to obtain corresponding line interval size, and simultaneously calculating paragraph visual information to determine whether the paragraph is a paragraph head;
s10: and outputting the editable word document according to the fonts, the word sizes, the line spacing and the paragraph visual information.
Further, the step S1 includes: and establishing the corresponding relation between all the word width pixels and the line spacing pixels and the common fonts, word sizes and line spacing sizes in word.
Further, the step S2 includes: and executing a localization encryption program when uploading the image type scanning piece PDF.
Further, the step S3 of preprocessing includes: seal removal, tilt correction, noise removal, etc. are used.
Further, the step S4 includes: the image type scanning piece PDF is a long text, page-by-page analysis processing is carried out on the long text image type scanning piece picture, text line region detection and line region positioning are achieved, and the starting coordinate information and the ending coordinate information of each line region are calculated through analysis.
Further, the step S9 includes: and determining line space pixels by calculating the line width difference between the starting coordinate information of the high-order line and the starting coordinate information of the low-order line according to the starting coordinate information and the ending coordinate information of the adjacent line areas, and determining paragraph visual information by calculating the line width difference between the starting coordinate information of the high-order line and the starting coordinate information of the low-order line.
Further, the step S9 determines that the segment head: and calculating the line width difference value of the starting coordinates of the adjacent lines to be about 2 times of the width pixels of the characters by calculating the S7, namely marking the line as the head of the segment, and leaving two grids in front of the line.
The invention provides a visual information reproduction method after image scanning piece recognition, which is characterized in that a comparison table of fonts, word sizes and line spacing is searched through word width pixels in the steps S4 to S8 to obtain corresponding word sizes, a segment head is calculated through the step S9, and 2 characters in the segment head space are analyzed.
The invention provides a visual information reproduction system after image type scanning piece identification, which is loaded on a local CPU server and used by multiple users in parallel, wherein the visual information reproduction system after image type scanning piece identification is an image type scanning piece identification content visual information recovery system and comprises the following components:
a memory for storing executable instructions;
and the processor is used for realizing the visual information reproduction method after the image scanning piece is identified when the executable instructions stored in the memory are operated.
The invention provides a computer readable storage medium storing executable instructions which when executed by a processor implement the visual information reproduction method after the identification of the image scanning piece according to any one of the above.
The visual information reproduction method, the visual information reproduction system and the storage medium after the image type scanning piece is identified can realize that multiple users can use the system to upload the image type scanning piece PDF file to develop the visual information reproduction system in a concurrent mode, the system adopts a visual information analysis algorithm based on computer vision to analyze the visual information such as fonts, typesetting patterns and the like corresponding to the PDF content of the image type scanning piece, and outputs an editable word document of the corresponding visual information.
Drawings
FIG. 1 is a schematic diagram of an embodiment of the present invention.
FIG. 2 is a flow chart of an embodiment of the present invention.
Fig. 3 is a schematic diagram of a text line area detection result according to an embodiment of the present invention.
Fig. 4 is a PDF file and a word file before and after processing by the visual information restoring system of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1 and 2, the present invention provides a visual information reproduction system after image type scan piece identification, which is loaded on a local CPU server and is used concurrently by multiple users.
The processing steps comprise the following steps:
firstly, establishing a comparison table of word width pixels, line spacing pixels and the font size of a font library in a word document;
and establishing the corresponding relation between all the word width pixels and the line spacing pixels and the common fonts, word sizes and line spacing sizes in word.
Secondly, uploading the image type scanning piece PDF to a content visual information recovery system, and obtaining the image type scanning piece PDF by the system;
the system performs a localized encryption procedure during acquisition of the image scanner PDF.
Thirdly, cutting the PDF of the image type scanning piece page by page into a picture format and preprocessing the picture;
preprocessing for reducing interference factors of PDF obstruction of the acquired image-type scanner is performed, and the preprocessing step includes the use of seal removal, tilt correction, noise removal, and the like.
Then, the position information of the text line areas of all cut pictures of the image type scanning part PDF is calculated through a text line detection algorithm based on deep learning and computer vision technology, namely the word width L is calculated (i) And the start coordinate information and the end coordinate information of all the text lines, including line width W and line height H;
the image type scanning piece PDF is a long text, and page-by-page analysis processing is carried out on the long text image type scanning piece PDF, so that text line area detection and line area positioning are realized, and the initial coordinate information and the end coordinate information of each text line area are analyzed and calculated;
the line position information of the same line is that the initial sitting mark of the text line area is P 0 [w,h]Text line region end-of-line marker P 1 [w,h],P 0 And P 1 Is not identical in position, P 0 Refer to the initial position of each line of text region, P 1 Refers to the end position of the area position of each line of characters, and the width of each line of characters is P 1 Width value minus P 0 Width value of each row of characters isP 0 P is subtracted from the height value of (2) 1 Height value of (i.e. row width w=wp) 1 -Wp 0 The line height is H=Hp 0 -Hp 1
Then recognizing the text content in all text line detection areas by an OCR text recognition technology based on deep learning;
OCR (Optical Character Recognition ) refers to the process of an electronic device (e.g., a scanner or digital camera) checking characters printed on paper, determining their shape by detecting dark and light patterns, and then translating the shape into computer text using a character recognition method; that is, the technology of converting the characters in the paper document into the image file of black-white lattice by optical mode and converting the characters in the image into the text format by the recognition software for further editing and processing by the word processing software is adopted.
Further, the number of punctuation marks included in the characters in the travel detection area is calculated and recorded as SUM;
calculating the average width pixels of the characters according to the obtained line width W and the line character number SUM, and marking the average width pixels of the characters asI.e. < ->And the average width pixel of the characters is compared with the calculated character width L (i) Comparing, and taking the smaller value as the final word width pixel value, namely, finally determining the word width pixel value as follows: />
Further bringing the obtained word width pixel value L into a comparison table to obtain a corresponding font and a font size, and matching and corresponding the font with the position information of the line detection area;
calculating line interval pixels according to the initial coordinate information and the end coordinate information of the text lines and substituting the line interval pixels into a comparison table to obtain corresponding line interval sizes, and simultaneously calculating paragraph visual information to determine whether the text lines are the first segment;
the method comprises the steps of firstly determining start coordinate information and end coordinate information of adjacent line areas, determining line space pixels by calculating the line height difference value of the end coordinate information of the high-order line and the start coordinate information of the low-order line, and then determining paragraph visual information by calculating the line width difference value of the start coordinate information of the high-order line and the low-order line.
For example, as shown in FIG. 3, the start coordinate information and the end coordinate information of the current text adjacent line are obtained, wherein the coordinates are as followsFor the start and end coordinates of the upper row, < >> Starting and ending coordinates of the lower row, as can be seen from fig. 3, +.>And->Is not identical in position,/>Refer to the starting coordinate position of the row, +.>Refer to the end coordinate position of each row, the width pixel of each row of characters is +.>Is subtracted by +.>The width pixel value of each row of characters is +.>Subtracting +.>Is a high pixel value of (1);
the row-spacing pixels of adjacent rows are calculated by:
corresponding the calculation result with a comparison table to determine the text line spacing;
calculating the width difference value of the starting coordinates of the adjacent lines by the following formula to obtain paragraph visual information;
from the above equation: the calculated width difference of adjacent lines is a positive number, and the positive number may be an integer value or a non-integer value, i.e. if the calculated width difference is approximately equal to the average width pixel of the textAnd 2 times, namely the head of the segment, and two grids are left in front of the row.
As shown in fig. 4, the image-type scanplan PDF outputs an editable word document according to the font, the font size, the line spacing and the paragraph visual information through a text line detection algorithm based on the deep learning and the computer visual technology and an OCR text recognition technology based on the deep learning, and a series of calculations, the left text in fig. 4 is an image-type scanplan PDF source file, and the right text in fig. 4 is the word document content after visual information recovery.
In the invention, the corresponding fonts and font sizes are obtained by looking up the comparison tables of the fonts and font sizes through the font width pixels in the steps S4 to S8, the segment head is calculated through the step S9, and the average width pixels of the 2 characters in the segment head space are analyzed.
The invention provides a visual information reproduction system after image type scanning piece identification, which is loaded on a local CPU server and is used by multiple users in parallel, wherein the visual information reproduction system after image type scanning piece identification is an image type scanning piece identification content visual information recovery system and comprises the following components:
a memory for storing executable instructions;
and the processor is used for realizing the visual information reproduction method after the image scanning piece is identified when the executable instructions stored in the memory are operated.
The invention also provides a computer readable storage medium storing executable instructions which when executed by a processor implement the visual information reproduction method after the image scanning piece is identified according to any one of the above.
The invention establishes a comparison table of word width pixels, line spacing pixels and font size of a font library in a word document by loading an image type scanning piece identification content visual information recovery system on a local CPU server, a user uploads an image type scanning piece PDF to be reproduced with visual information, a security program is executed by the system in the process of uploading the image type scanning piece PDF, then the image type scanning piece PDF is preprocessed, and then the word width L is calculated by adopting a visual information analysis algorithm based on computer vision and an OCR character identification technology based on deep learning by adopting the system (i) And line width W and line height H, identifying the text content in all text line areas, further calculating the number SUM of the text in the line detection, and then calculating the average width pixels of the textFinally, word width pixels +.>And the word size is carried out with a comparison tableCorresponding fonts and font sizes are obtained; calculating line interval pixels by acquiring the starting coordinates and the ending coordinates of adjacent lines of the current text, and determining the line interval pixels by performing line interval correspondence with a comparison table; then calculating the paragraph visual information, calculating the width difference of the starting coordinates of the adjacent lines by the formula to obtain the calculation result of the paragraph visual information, and if the calculation result is about the average width pixel of the characters>About 2 times, namely marked as the head of the segment, two lattices in front of the row; finally, the editable word document is output according to the fonts, the word sizes, the line spacing and the paragraph visual information, so that the purpose of the invention is achieved.
The image type scanning piece identification content visual information recovery system supports multi-user concurrent uploading, does not affect each other, and the uploaded file has confidentiality and cannot leak user data.
The foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes or direct or indirect application in other related technical fields are included in the scope of the present invention.

Claims (10)

1. A visual information reproduction method after image type scanning piece identification is characterized in that: the method comprises the following steps:
s1: establishing a comparison table of word width pixels, line spacing pixels and font library fonts and line spacing sizes in word documents;
s2: uploading the PDF of the image scanning piece;
s3: cutting the PDF of the image type scanning piece page by page into a picture format and preprocessing the picture;
s4: calculating the position information of the text line areas of all cut pictures of the image type scanning piece PDF through a text line detection algorithm based on deep learning and computer vision technology, namely calculating the word width pixels of each word in the line areas and the initial coordinate information and the end coordinate information of all text line areas;
s5: recognizing the text content in all the row detection areas by an OCR text recognition technology based on deep learning;
s6: calculating the number of punctuations included in the characters in the travel detection area;
s7: calculating the average width pixels of the characters according to the line width and the line character number calculated by the starting coordinate and the ending coordinate information of the line area of the characters, comparing the average width pixels of the characters with the word width pixels calculated in the step S4, and taking a smaller value as a final word width pixel value;
s8: the pixel value of the word width obtained in the step S7 is brought into a comparison table to obtain a corresponding font and a font size, and the font, the font size and the line detection area position information are matched and correspond;
s9: calculating line interval pixels according to the initial coordinate information and the end coordinate information of the text line area, and substituting the line interval pixels into a comparison table to obtain corresponding line interval size, and simultaneously calculating paragraph visual information to determine whether the paragraph is a paragraph head;
s10: and outputting the editable word document according to the fonts, the word sizes, the line spacing and the paragraph visual information.
2. The method for reproducing visual information after recognition of an image-type scanner according to claim 1, wherein: the step S1 includes: and establishing the corresponding relation between all the word width pixels and the line spacing pixels and the common fonts, word sizes and line spacing sizes in word.
3. The method for reproducing visual information after recognition of an image-type scanner according to claim 1, wherein: the step S2 includes: and executing a localization encryption program when uploading the image type scanning piece PDF.
4. The method for reproducing visual information after recognition of an image-type scanner according to claim 1, wherein: the step S3 of preprocessing comprises the following steps: stamp removal, tilt correction, noise removal are used.
5. The method for reproducing visual information after recognition of an image-type scanner according to claim 1, wherein: the step S4 includes: the image type scanning piece PDF is a long text, page-by-page analysis processing is carried out on the long text image type scanning piece picture, text line region detection and line region positioning are achieved, and the starting coordinate information and the ending coordinate information of each line region are calculated through analysis.
6. The method for reproducing visual information after recognition of an image-type scanner according to claim 1, wherein: the step S9 includes: and determining line space pixels by calculating the line width difference between the starting coordinate information of the high-order line and the starting coordinate information of the low-order line according to the starting coordinate information and the ending coordinate information of the adjacent line areas, and determining paragraph visual information by calculating the line width difference between the starting coordinate information of the high-order line and the starting coordinate information of the low-order line.
7. The method for reproducing information from an image-type scanner as defined in claim 6, wherein: the step S9 determines the segment head: and (3) calculating 2 times of the width pixels of the characters by calculating the line width difference value of the initial coordinates of the adjacent lines, namely marking the line as the head of the segment and two blank grids in front of the line.
8. A visual information reproducing method after recognition of an image type scanner according to any one of claims 1 to 7, wherein: and searching a comparison table of fonts, word sizes and line spacing through the word width pixels in the steps S4 to S8 to obtain corresponding word sizes, calculating segment heads through the step S9, and analyzing 2 characters in the segment head space.
9. The visual information reproduction system after the image scanning piece is identified is loaded on a local CPU server, and the system is used by multiple users in parallel, and is characterized in that: the visual information reproduction system after the image type scanning piece identification is an image type scanning piece identification content visual information recovery system, comprising: a memory for storing executable instructions; a processor for implementing the visual information reproduction method after recognition of an image-type scanner according to any one of claims 1 to 8 when executing the executable instructions stored in the memory.
10. A computer-readable storage medium storing executable instructions that when executed by a processor implement the method for reproducing visual information after recognition of an image-type scanner according to any one of claims 1 to 8.
CN202010580263.XA 2020-06-23 2020-06-23 Visual information reproduction method, system and storage medium after image type scanning piece identification Active CN113836971B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010580263.XA CN113836971B (en) 2020-06-23 2020-06-23 Visual information reproduction method, system and storage medium after image type scanning piece identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010580263.XA CN113836971B (en) 2020-06-23 2020-06-23 Visual information reproduction method, system and storage medium after image type scanning piece identification

Publications (2)

Publication Number Publication Date
CN113836971A CN113836971A (en) 2021-12-24
CN113836971B true CN113836971B (en) 2023-12-29

Family

ID=78964084

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010580263.XA Active CN113836971B (en) 2020-06-23 2020-06-23 Visual information reproduction method, system and storage medium after image type scanning piece identification

Country Status (1)

Country Link
CN (1) CN113836971B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114926840B (en) * 2022-05-13 2024-06-21 贵州大学 Method and system for converting photocopy PDF into replicable PDF
CN115471206A (en) * 2022-09-29 2022-12-13 深圳标普云科技有限公司 Contract management and control method and contract management system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090123523A (en) * 2008-05-28 2009-12-02 고려대학교 산학협력단 System and method for recognizing optical characters
JP2015146122A (en) * 2014-02-03 2015-08-13 シャープ株式会社 Conversion processing device, information processing apparatus including the same, program, and recording medium
CN106250830A (en) * 2016-07-22 2016-12-21 浙江大学 Digital book structured analysis processing method
US10467466B1 (en) * 2019-05-17 2019-11-05 NextVPU (Shanghai) Co., Ltd. Layout analysis on image
CN110929479A (en) * 2018-09-03 2020-03-27 珠海金山办公软件有限公司 Method and device for converting PDF scanning piece, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090123523A (en) * 2008-05-28 2009-12-02 고려대학교 산학협력단 System and method for recognizing optical characters
JP2015146122A (en) * 2014-02-03 2015-08-13 シャープ株式会社 Conversion processing device, information processing apparatus including the same, program, and recording medium
CN106250830A (en) * 2016-07-22 2016-12-21 浙江大学 Digital book structured analysis processing method
CN110929479A (en) * 2018-09-03 2020-03-27 珠海金山办公软件有限公司 Method and device for converting PDF scanning piece, electronic equipment and storage medium
US10467466B1 (en) * 2019-05-17 2019-11-05 NextVPU (Shanghai) Co., Ltd. Layout analysis on image

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
连通域结合重叠度的维吾尔文档图像文字切分;姑丽祖热・吐尔逊;尤努斯・艾沙;吐尔根・依布拉音;库尔班・吾布力;;计算机工程与设计(07);第200-205页 *

Also Published As

Publication number Publication date
CN113836971A (en) 2021-12-24

Similar Documents

Publication Publication Date Title
US10339378B2 (en) Method and apparatus for finding differences in documents
US9922247B2 (en) Comparing documents using a trusted source
US8718364B2 (en) Apparatus and method for digitizing documents with extracted region data
US8320019B2 (en) Image processing apparatus, image processing method, and computer program thereof
US8660356B2 (en) Recognizing text at multiple orientations
JP5934762B2 (en) Document modification detection method by character comparison using character shape characteristics, computer program, recording medium, and information processing apparatus
US8428356B2 (en) Image processing device and image processing method for generating electronic document with a table line determination portion
JP5121599B2 (en) Image processing apparatus, image processing method, program thereof, and storage medium
US9436882B2 (en) Automated redaction
US20060197928A1 (en) Image processing apparatus and its method
US20150228045A1 (en) Methods for embedding and extracting a watermark in a text document and devices thereof
CN113836971B (en) Visual information reproduction method, system and storage medium after image type scanning piece identification
CN112069991A (en) PDF table information extraction method and related device
CN112949471A (en) Domestic CPU-based electronic official document identification reproduction method and system
RU2597163C2 (en) Comparing documents using reliable source
JP2002015280A (en) Device and method for image recognition, and computer- readable recording medium with recorded image recognizing program
CN113806472A (en) Method and equipment for realizing full-text retrieval of character, picture and image type scanning piece
US10572751B2 (en) Conversion of mechanical markings on a hardcopy document into machine-encoded annotations
US9483694B2 (en) Image text search and retrieval system
JP7150809B2 (en) Document digitization architecture by multi-model deep learning, document image processing program
JP6786073B2 (en) Inspection device, inspection method and program
JP5298830B2 (en) Image processing program, image processing apparatus, and image processing system
CN115527222A (en) Character recognition method, device, equipment and storage medium
CN115437504A (en) Page number positioning method, auxiliary reading method based on page number positioning method and application
CN116721431A (en) Method for restoring character typesetting in image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant