CN110543810A - Technology for completely identifying header and footer of PDF (Portable document Format) file - Google Patents
Technology for completely identifying header and footer of PDF (Portable document Format) file Download PDFInfo
- Publication number
- CN110543810A CN110543810A CN201910587311.5A CN201910587311A CN110543810A CN 110543810 A CN110543810 A CN 110543810A CN 201910587311 A CN201910587311 A CN 201910587311A CN 110543810 A CN110543810 A CN 110543810A
- Authority
- CN
- China
- Prior art keywords
- header
- footer
- page
- value
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/416—Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Document Processing Apparatus (AREA)
Abstract
A PDF file header and footer identification method comprises the following steps: and analyzing the PDF to obtain PDF original storage data, and splitting according to each page. Identifying header footers according to the sequence of data stored in the PDF pages; the method is characterized in that: in the case of no header footer, the data of the PDF document is stored one by one from top to bottom and from left to right, but in the case of a header footer, the PDF document stores the header first and then the footer, and then the body data portion. And acquiring a header and footer according to the document data sequence and the position of the bottommost line data of the page. And judging according to the distance from the text data to the bottom end, and identifying and acquiring header and footer according to the characteristics of the PDF file in the pure picture format. The method comprises the following steps: header and tail features of the page are searched, analysis is carried out according to the feature conditions of multiple pages, and various header and tail forms are classified.
Description
The technical field is as follows:
The invention relates to a header and footer data separation processing method of PDF (Portable document Format).
background art:
1. at present, almost all education papers, public company announcements are distributed in channels such as a learning network, an exchange, a deep exchange and the like in a PDF file format, the format is convenient for people to read across equipment, but for data acquisition through documents, the data extraction is complex, such as a large sea fishing needle, and no structured data exists;
2. the structural extraction of the PDF file is processed by cutting out header and footer areas to avoid pollution to the main content of the original text;
3. for a PDF file with a pure picture format, image recognition (OCR) needs to be performed on page content to acquire all frame lines, text coordinate data and text content;
4. For a PDF file with a normal format, open source software such as PDF.
the invention content is as follows:
The application provides a header and footer identification method and a header and footer identification device of a PDF document, which are mainly divided into two processing modes:
1. normal format PDF file processing
(1) acquiring original analysis data of a PDF file;
(2) judging whether the data belongs to a header footer or not according to the sequence of the analyzed data and the distance between the data and the bottom of the page;
(3) For a header belonging to the upper half of the page;
(4) For the subpage pin in the lower half of the page;
(5) for those areas in the middle of the header footer that are too small, the reacquisition is performed according to the following algorithm.
2. Header and footer identification for pure pictures or abnormal formats
(1) searching header and footer characteristics of the head and the tail of the page;
(2) Analyzing according to the characteristic conditions of the multiple pages;
(3) the classification is performed for various header and footer forms.
description of the drawings:
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a flowchart illustrating a complete header and footer identification technique for a PDF file according to the present invention.
the specific implementation method comprises the following steps:
PDF file with normal non-pure picture (scanning piece) format
1. Js open source software is used for acquiring original data after PDF file analysis, and coordinate data of the text is extracted according to the sequence of the analyzed data.
The coordinate data includes
x: distance from left frame of page
y: distance from bottom of page
w: width of text data
h: height of text data
2. Judging whether a header and a footer exist:
(1) Setting id of the extracted text coordinates according to the sequence;
(2) then sorting according to the y value to obtain the minimum y value y _ min;
(3) the last _ y value of the text label with the largest id (the last label of the page in the original order) is compared with y _ min
difference is last _ y-y _ min
(4) If the difference value is lower than a certain threshold value, namely the same line is obtained, no header and no footer exist;
(5) if the difference value is higher than a certain threshold value, the two rows are not the same, and a header and a footer exist;
3. if there is a header footer, find the area of the header and footer (y value)
(1) Finding other texts on the same line of the text with the minimum y value;
(2) finding the maximum value of id row _ id _ max in the text line, wherein the part of id in the page, which belongs to the header and is less than or equal to the maximum value of id row _ id _ max, belongs to the header and footer;
(3) wherein, the upper half area of the page belongs to a header, and the lower half area of the page belongs to a footer;
(4) Finding the minimum value of the y value of the header and the maximum value of the y value of the footer;
4. And (3) correction:
the middle area of the header and the footer needs to be more than half of the page,
if the minimum value of the y value of the header minus the maximum value of the footer is less than half, the header footer data just acquired is discarded. And then acquiring a header footer by using the following characteristic identification mode.
second, PDF file in pure picture (scanning element) format or header and footer which can not be obtained by first method
1. If the picture is pure picture, the coordinates of the line and the coordinates of the character block are obtained through image recognition
2. finding the characteristics of a header and a footer:
(1) If the ratio of the occurrence of the following characteristics of the upper one third or the lower one third of each page reaches a certain threshold value, determining that the page header and the page footer are page headers
(2) feature(s)
a if there is a horizontal line and the horizontal line is relatively fixed, then it is determined as the boundary of the header and the footer, the upper part of the line is the header, the lower part of the line is the footer
b, under the condition that no horizontal line exists, judging the first three lines of texts on the upper half part and the last five lines of texts on the lower half part of each page:
i if a text block with similar characteristics such as centering, left centering, right centering and the like appears at the same position (one third of the upper half or one third of the lower half) of each page, and the width and the height occupied by the text block are similar, the page is a header and a footer;
ii, performing digital feature recognition on the character string contents of the text blocks, and if the character string contents conform to the features of continuous numbers, containing header and footer of page numbers.
(3) Correction
Different header and footer modes may exist in one pdf and can be classified
the specific classification algorithm is as follows:
a, according to a certain threshold value, obtaining relevant data of a header and footer text block preliminarily obtained, including: number, position, height, width, cross line position,
b, classifying each data according to a certain difference value, wherein each data classification has a corresponding id;
c, generating digital character strings by the ids, and removing the duplication to obtain header and footer types and the specific occurrence times of each type;
d confirming the header footer according to the occurrence times and the proportion.
Claims (6)
1. a method for completely identifying headers and footers of a PDF file is characterized by comprising the following steps:
(1) Acquiring original data by utilizing pdf.js open source software aiming at a PDF file in a normal non-pure picture (swept area) format;
(2) judging whether a header and a footer exist or not;
(3) Searching a longitudinal range value of a header footer in the page;
(4) correcting the search result 1;
(5) aiming at the PDF file with larger result deviation or pure picture format, acquiring relevant information such as line segments, texts and the like in a page by utilizing image identification;
(6) searching a header and a footer and characteristics thereof in the page;
(7) the search result 2 is corrected.
2. the method of claim 1, wherein the step of determining whether a header footer exists comprises:
(1) Setting id of the extracted text according to the sequence of the text in the file;
(2) sequencing the texts from small to large according to the obtained text longitudinal position values to obtain the minimum longitudinal position value of the texts in the page;
(3) comparing the longitudinal position value of the text with the maximum id value with the result, and judging whether the two sections of texts are in the same line or not according to the comparison of the difference value of the longitudinal position value of the text with the maximum id value and a threshold value;
(4) If the two sections of texts are not in the same line, a header footer exists in the current page, otherwise, the header footer does not exist.
3. the method of claim 1, wherein the step of finding a value of a vertical extent of a header footer within a page comprises:
(1) Aiming at the result of claim 2, if the current page has a header footer, searching other texts in the same line according to the text with the minimum longitudinal position value in the page, and obtaining the corresponding maximum id value in the texts;
(2) and all the texts with the id values smaller than the searched id values are header and footer of the current page, judging whether the texts are the headers or the footers according to the longitudinal position values of the texts, and simultaneously acquiring the minimum longitudinal position value of the headers and the maximum longitudinal position value of the footers.
4. the method according to claim 1, wherein the decision criterion for modifying the search result 1 is:
(1) the middle area of the header and the footer needs to be more than half of the page;
(2) if the minimum value of the y value of the header minus the maximum value of the footer is less than half, the header footer data just acquired is discarded.
5. the method of claim 1, wherein the step of finding the top and bottom of the page and its features comprises:
the header footer is located according to the following specified features:
(1) if there is a horizontal line and the position of the horizontal line is relatively fixed, the boundary of the header and the footer is judged,
the upper part of the line of the upper half part is a header, and the lower part of the line of the lower half part is a footer;
(2) in the case of no horizontal line, the judgment is made for the first three lines of text in the upper half and the last five lines of text in the lower half of each page:
(i) if a text block with similar characteristics such as center, left, right and the like appears at the same position (the upper half of one third or the lower half of one third) of each page, and the width and the height occupied by the text block are similar, the page is a header and a footer;
(ii) and performing digital feature recognition on the character string contents of the text blocks, and if the character string contents conform to the features of continuous numbers, containing header and footer of page numbers.
6. the method of claim 1, wherein the step of modifying the search result 2 comprises:
classifying the header and footer modes in the PDF file, which comprises the following steps:
(1) according to a certain threshold value, the method for obtaining the related data of the header and footer text blocks preliminarily comprises the following steps: number, position, height, width, and cross line position;
classifying each data according to a certain difference value, wherein the classification part of each data has a corresponding id;
(2) Generating digital character strings by the ids, and removing the duplication to obtain header and footer types and the specific occurrence times of each type;
(3) And confirming the header footer according to the occurrence times and the proportion.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910587311.5A CN110543810A (en) | 2019-06-28 | 2019-06-28 | Technology for completely identifying header and footer of PDF (Portable document Format) file |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910587311.5A CN110543810A (en) | 2019-06-28 | 2019-06-28 | Technology for completely identifying header and footer of PDF (Portable document Format) file |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110543810A true CN110543810A (en) | 2019-12-06 |
Family
ID=68709989
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910587311.5A Pending CN110543810A (en) | 2019-06-28 | 2019-06-28 | Technology for completely identifying header and footer of PDF (Portable document Format) file |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110543810A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111209865A (en) * | 2020-01-06 | 2020-05-29 | 中科鼎富(北京)科技发展有限公司 | File content extraction method and device, electronic equipment and storage medium |
CN112036132A (en) * | 2020-09-01 | 2020-12-04 | 珠海豹趣科技有限公司 | Document header and footer editing method and device and electronic equipment |
CN112329426A (en) * | 2020-11-12 | 2021-02-05 | 北京方正印捷数码技术有限公司 | Header and footer identification method, apparatus, device and medium for electronic file |
CN112861821A (en) * | 2021-04-06 | 2021-05-28 | 刘羽 | Map data reduction method based on PDF file analysis |
CN113989314A (en) * | 2021-10-26 | 2022-01-28 | 深圳前海环融联易信息科技服务有限公司 | Method for removing header and footer based on Hough transform linear detection |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060156226A1 (en) * | 2005-01-10 | 2006-07-13 | Xerox Corporation | Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents |
US7607081B1 (en) * | 2002-06-28 | 2009-10-20 | Microsoft Corporation | Storing document header and footer information in a markup language document |
CN104951429A (en) * | 2014-03-26 | 2015-09-30 | 阿里巴巴集团控股有限公司 | Recognition method and device for page headers and page footers of format electronic document |
JP2017195499A (en) * | 2016-04-20 | 2017-10-26 | 富士ゼロックス株式会社 | Image storage apparatus and image storage program |
-
2019
- 2019-06-28 CN CN201910587311.5A patent/CN110543810A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7607081B1 (en) * | 2002-06-28 | 2009-10-20 | Microsoft Corporation | Storing document header and footer information in a markup language document |
US20060156226A1 (en) * | 2005-01-10 | 2006-07-13 | Xerox Corporation | Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents |
CN104951429A (en) * | 2014-03-26 | 2015-09-30 | 阿里巴巴集团控股有限公司 | Recognition method and device for page headers and page footers of format electronic document |
JP2017195499A (en) * | 2016-04-20 | 2017-10-26 | 富士ゼロックス株式会社 | Image storage apparatus and image storage program |
Non-Patent Citations (2)
Title |
---|
LIN X F: "Header and Footer Extraction by Page-Association", 《SPIE CONFERENCE ON DOCUMENT RECOGNITION AND RETRIEVAL》 * |
刘高军,刘妍妍,付晓玲: "基于分割线和区域特征的页眉页脚判别方法", 《北方工业大学学报》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111209865A (en) * | 2020-01-06 | 2020-05-29 | 中科鼎富(北京)科技发展有限公司 | File content extraction method and device, electronic equipment and storage medium |
CN112036132A (en) * | 2020-09-01 | 2020-12-04 | 珠海豹趣科技有限公司 | Document header and footer editing method and device and electronic equipment |
CN112036132B (en) * | 2020-09-01 | 2024-04-19 | 珠海豹趣科技有限公司 | Method and device for editing header and footer of document and electronic equipment |
CN112329426A (en) * | 2020-11-12 | 2021-02-05 | 北京方正印捷数码技术有限公司 | Header and footer identification method, apparatus, device and medium for electronic file |
CN112329426B (en) * | 2020-11-12 | 2024-05-28 | 北京方正印捷数码技术有限公司 | Method, device, equipment and medium for recognizing header and footer of electronic file |
CN112861821A (en) * | 2021-04-06 | 2021-05-28 | 刘羽 | Map data reduction method based on PDF file analysis |
CN112861821B (en) * | 2021-04-06 | 2024-04-19 | 刘羽 | Map data reduction method based on PDF file analysis |
CN113989314A (en) * | 2021-10-26 | 2022-01-28 | 深圳前海环融联易信息科技服务有限公司 | Method for removing header and footer based on Hough transform linear detection |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110543810A (en) | Technology for completely identifying header and footer of PDF (Portable document Format) file | |
EP0854433B1 (en) | Caption and photo extraction from scanned document images | |
EP1052593B1 (en) | Form search apparatus and method | |
US8462394B2 (en) | Document type classification for scanned bitmaps | |
US8401303B2 (en) | Method and apparatus for identifying character areas in a document image | |
JP2001167131A (en) | Automatic classifying method for document using document signature | |
US6917708B2 (en) | Handwriting recognition by word separation into silhouette bar codes and other feature extraction | |
Antonacopoulos et al. | A robust braille recognition system | |
CN113221711A (en) | Information extraction method and device | |
JP2007172132A (en) | Layout analysis program, layout analysis device and layout analysis method | |
CN102782702A (en) | Paragraph recognition in an optical character recognition (OCR) process | |
CN115994230A (en) | Intelligent archive construction method integrating artificial intelligence and knowledge graph technology | |
JP3485020B2 (en) | Character recognition method and apparatus, and storage medium | |
CN108921160B (en) | Book identification method, electronic equipment and storage medium | |
CN110287784B (en) | Annual report text structure identification method | |
CN108052955B (en) | High-precision Braille identification method and system | |
JP4077919B2 (en) | Image processing method and apparatus and storage medium therefor | |
JP2000285190A (en) | Method and device for identifying slip and storage medium | |
JP3608965B2 (en) | Automatic authoring device and recording medium | |
CN111832497B (en) | Text detection post-processing method based on geometric features | |
US20010043742A1 (en) | Communication document detector | |
CN110728240A (en) | Method and device for automatically identifying title of electronic file | |
CN115543915A (en) | Automatic database building method and system for personnel file directory | |
CN112560849B (en) | Neural network algorithm-based grammar segmentation method and system | |
Basu et al. | Segmentation of offline handwritten Bengali script |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20191206 |
|
WD01 | Invention patent application deemed withdrawn after publication |