CN110543810A - Technology for completely identifying header and footer of PDF (Portable document Format) file - Google Patents

Technology for completely identifying header and footer of PDF (Portable document Format) file Download PDF

Info

Publication number
CN110543810A
CN110543810A CN201910587311.5A CN201910587311A CN110543810A CN 110543810 A CN110543810 A CN 110543810A CN 201910587311 A CN201910587311 A CN 201910587311A CN 110543810 A CN110543810 A CN 110543810A
Authority
CN
China
Prior art keywords
header
footer
page
value
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910587311.5A
Other languages
Chinese (zh)
Inventor
徐茂龙
杨鸿健
程晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Zhilu Information Technology Co ltd
Original Assignee
Nanjing Zhilu Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Zhilu Information Technology Co ltd filed Critical Nanjing Zhilu Information Technology Co ltd
Priority to CN201910587311.5A priority Critical patent/CN110543810A/en
Publication of CN110543810A publication Critical patent/CN110543810A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Document Processing Apparatus (AREA)

Abstract

A PDF file header and footer identification method comprises the following steps: and analyzing the PDF to obtain PDF original storage data, and splitting according to each page. Identifying header footers according to the sequence of data stored in the PDF pages; the method is characterized in that: in the case of no header footer, the data of the PDF document is stored one by one from top to bottom and from left to right, but in the case of a header footer, the PDF document stores the header first and then the footer, and then the body data portion. And acquiring a header and footer according to the document data sequence and the position of the bottommost line data of the page. And judging according to the distance from the text data to the bottom end, and identifying and acquiring header and footer according to the characteristics of the PDF file in the pure picture format. The method comprises the following steps: header and tail features of the page are searched, analysis is carried out according to the feature conditions of multiple pages, and various header and tail forms are classified.

Description

technology for completely identifying header and footer of PDF (Portable document Format) file
The technical field is as follows:
The invention relates to a header and footer data separation processing method of PDF (Portable document Format).
background art:
1. at present, almost all education papers, public company announcements are distributed in channels such as a learning network, an exchange, a deep exchange and the like in a PDF file format, the format is convenient for people to read across equipment, but for data acquisition through documents, the data extraction is complex, such as a large sea fishing needle, and no structured data exists;
2. the structural extraction of the PDF file is processed by cutting out header and footer areas to avoid pollution to the main content of the original text;
3. for a PDF file with a pure picture format, image recognition (OCR) needs to be performed on page content to acquire all frame lines, text coordinate data and text content;
4. For a PDF file with a normal format, open source software such as PDF.
the invention content is as follows:
The application provides a header and footer identification method and a header and footer identification device of a PDF document, which are mainly divided into two processing modes:
1. normal format PDF file processing
(1) acquiring original analysis data of a PDF file;
(2) judging whether the data belongs to a header footer or not according to the sequence of the analyzed data and the distance between the data and the bottom of the page;
(3) For a header belonging to the upper half of the page;
(4) For the subpage pin in the lower half of the page;
(5) for those areas in the middle of the header footer that are too small, the reacquisition is performed according to the following algorithm.
2. Header and footer identification for pure pictures or abnormal formats
(1) searching header and footer characteristics of the head and the tail of the page;
(2) Analyzing according to the characteristic conditions of the multiple pages;
(3) the classification is performed for various header and footer forms.
description of the drawings:
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a flowchart illustrating a complete header and footer identification technique for a PDF file according to the present invention.
the specific implementation method comprises the following steps:
PDF file with normal non-pure picture (scanning piece) format
1. Js open source software is used for acquiring original data after PDF file analysis, and coordinate data of the text is extracted according to the sequence of the analyzed data.
The coordinate data includes
x: distance from left frame of page
y: distance from bottom of page
w: width of text data
h: height of text data
2. Judging whether a header and a footer exist:
(1) Setting id of the extracted text coordinates according to the sequence;
(2) then sorting according to the y value to obtain the minimum y value y _ min;
(3) the last _ y value of the text label with the largest id (the last label of the page in the original order) is compared with y _ min
difference is last _ y-y _ min
(4) If the difference value is lower than a certain threshold value, namely the same line is obtained, no header and no footer exist;
(5) if the difference value is higher than a certain threshold value, the two rows are not the same, and a header and a footer exist;
3. if there is a header footer, find the area of the header and footer (y value)
(1) Finding other texts on the same line of the text with the minimum y value;
(2) finding the maximum value of id row _ id _ max in the text line, wherein the part of id in the page, which belongs to the header and is less than or equal to the maximum value of id row _ id _ max, belongs to the header and footer;
(3) wherein, the upper half area of the page belongs to a header, and the lower half area of the page belongs to a footer;
(4) Finding the minimum value of the y value of the header and the maximum value of the y value of the footer;
4. And (3) correction:
the middle area of the header and the footer needs to be more than half of the page,
if the minimum value of the y value of the header minus the maximum value of the footer is less than half, the header footer data just acquired is discarded. And then acquiring a header footer by using the following characteristic identification mode.
second, PDF file in pure picture (scanning element) format or header and footer which can not be obtained by first method
1. If the picture is pure picture, the coordinates of the line and the coordinates of the character block are obtained through image recognition
2. finding the characteristics of a header and a footer:
(1) If the ratio of the occurrence of the following characteristics of the upper one third or the lower one third of each page reaches a certain threshold value, determining that the page header and the page footer are page headers
(2) feature(s)
a if there is a horizontal line and the horizontal line is relatively fixed, then it is determined as the boundary of the header and the footer, the upper part of the line is the header, the lower part of the line is the footer
b, under the condition that no horizontal line exists, judging the first three lines of texts on the upper half part and the last five lines of texts on the lower half part of each page:
i if a text block with similar characteristics such as centering, left centering, right centering and the like appears at the same position (one third of the upper half or one third of the lower half) of each page, and the width and the height occupied by the text block are similar, the page is a header and a footer;
ii, performing digital feature recognition on the character string contents of the text blocks, and if the character string contents conform to the features of continuous numbers, containing header and footer of page numbers.
(3) Correction
Different header and footer modes may exist in one pdf and can be classified
the specific classification algorithm is as follows:
a, according to a certain threshold value, obtaining relevant data of a header and footer text block preliminarily obtained, including: number, position, height, width, cross line position,
b, classifying each data according to a certain difference value, wherein each data classification has a corresponding id;
c, generating digital character strings by the ids, and removing the duplication to obtain header and footer types and the specific occurrence times of each type;
d confirming the header footer according to the occurrence times and the proportion.

Claims (6)

1. a method for completely identifying headers and footers of a PDF file is characterized by comprising the following steps:
(1) Acquiring original data by utilizing pdf.js open source software aiming at a PDF file in a normal non-pure picture (swept area) format;
(2) judging whether a header and a footer exist or not;
(3) Searching a longitudinal range value of a header footer in the page;
(4) correcting the search result 1;
(5) aiming at the PDF file with larger result deviation or pure picture format, acquiring relevant information such as line segments, texts and the like in a page by utilizing image identification;
(6) searching a header and a footer and characteristics thereof in the page;
(7) the search result 2 is corrected.
2. the method of claim 1, wherein the step of determining whether a header footer exists comprises:
(1) Setting id of the extracted text according to the sequence of the text in the file;
(2) sequencing the texts from small to large according to the obtained text longitudinal position values to obtain the minimum longitudinal position value of the texts in the page;
(3) comparing the longitudinal position value of the text with the maximum id value with the result, and judging whether the two sections of texts are in the same line or not according to the comparison of the difference value of the longitudinal position value of the text with the maximum id value and a threshold value;
(4) If the two sections of texts are not in the same line, a header footer exists in the current page, otherwise, the header footer does not exist.
3. the method of claim 1, wherein the step of finding a value of a vertical extent of a header footer within a page comprises:
(1) Aiming at the result of claim 2, if the current page has a header footer, searching other texts in the same line according to the text with the minimum longitudinal position value in the page, and obtaining the corresponding maximum id value in the texts;
(2) and all the texts with the id values smaller than the searched id values are header and footer of the current page, judging whether the texts are the headers or the footers according to the longitudinal position values of the texts, and simultaneously acquiring the minimum longitudinal position value of the headers and the maximum longitudinal position value of the footers.
4. the method according to claim 1, wherein the decision criterion for modifying the search result 1 is:
(1) the middle area of the header and the footer needs to be more than half of the page;
(2) if the minimum value of the y value of the header minus the maximum value of the footer is less than half, the header footer data just acquired is discarded.
5. the method of claim 1, wherein the step of finding the top and bottom of the page and its features comprises:
the header footer is located according to the following specified features:
(1) if there is a horizontal line and the position of the horizontal line is relatively fixed, the boundary of the header and the footer is judged,
the upper part of the line of the upper half part is a header, and the lower part of the line of the lower half part is a footer;
(2) in the case of no horizontal line, the judgment is made for the first three lines of text in the upper half and the last five lines of text in the lower half of each page:
(i) if a text block with similar characteristics such as center, left, right and the like appears at the same position (the upper half of one third or the lower half of one third) of each page, and the width and the height occupied by the text block are similar, the page is a header and a footer;
(ii) and performing digital feature recognition on the character string contents of the text blocks, and if the character string contents conform to the features of continuous numbers, containing header and footer of page numbers.
6. the method of claim 1, wherein the step of modifying the search result 2 comprises:
classifying the header and footer modes in the PDF file, which comprises the following steps:
(1) according to a certain threshold value, the method for obtaining the related data of the header and footer text blocks preliminarily comprises the following steps: number, position, height, width, and cross line position;
classifying each data according to a certain difference value, wherein the classification part of each data has a corresponding id;
(2) Generating digital character strings by the ids, and removing the duplication to obtain header and footer types and the specific occurrence times of each type;
(3) And confirming the header footer according to the occurrence times and the proportion.
CN201910587311.5A 2019-06-28 2019-06-28 Technology for completely identifying header and footer of PDF (Portable document Format) file Pending CN110543810A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910587311.5A CN110543810A (en) 2019-06-28 2019-06-28 Technology for completely identifying header and footer of PDF (Portable document Format) file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910587311.5A CN110543810A (en) 2019-06-28 2019-06-28 Technology for completely identifying header and footer of PDF (Portable document Format) file

Publications (1)

Publication Number Publication Date
CN110543810A true CN110543810A (en) 2019-12-06

Family

ID=68709989

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910587311.5A Pending CN110543810A (en) 2019-06-28 2019-06-28 Technology for completely identifying header and footer of PDF (Portable document Format) file

Country Status (1)

Country Link
CN (1) CN110543810A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111209865A (en) * 2020-01-06 2020-05-29 中科鼎富(北京)科技发展有限公司 File content extraction method and device, electronic equipment and storage medium
CN112036132A (en) * 2020-09-01 2020-12-04 珠海豹趣科技有限公司 Document header and footer editing method and device and electronic equipment
CN112329426A (en) * 2020-11-12 2021-02-05 北京方正印捷数码技术有限公司 Header and footer identification method, apparatus, device and medium for electronic file
CN112861821A (en) * 2021-04-06 2021-05-28 刘羽 Map data reduction method based on PDF file analysis
CN113989314A (en) * 2021-10-26 2022-01-28 深圳前海环融联易信息科技服务有限公司 Method for removing header and footer based on Hough transform linear detection

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060156226A1 (en) * 2005-01-10 2006-07-13 Xerox Corporation Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents
US7607081B1 (en) * 2002-06-28 2009-10-20 Microsoft Corporation Storing document header and footer information in a markup language document
CN104951429A (en) * 2014-03-26 2015-09-30 阿里巴巴集团控股有限公司 Recognition method and device for page headers and page footers of format electronic document
JP2017195499A (en) * 2016-04-20 2017-10-26 富士ゼロックス株式会社 Image storage apparatus and image storage program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7607081B1 (en) * 2002-06-28 2009-10-20 Microsoft Corporation Storing document header and footer information in a markup language document
US20060156226A1 (en) * 2005-01-10 2006-07-13 Xerox Corporation Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents
CN104951429A (en) * 2014-03-26 2015-09-30 阿里巴巴集团控股有限公司 Recognition method and device for page headers and page footers of format electronic document
JP2017195499A (en) * 2016-04-20 2017-10-26 富士ゼロックス株式会社 Image storage apparatus and image storage program

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LIN X F: "Header and Footer Extraction by Page-Association", 《SPIE CONFERENCE ON DOCUMENT RECOGNITION AND RETRIEVAL》 *
刘高军,刘妍妍,付晓玲: "基于分割线和区域特征的页眉页脚判别方法", 《北方工业大学学报》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111209865A (en) * 2020-01-06 2020-05-29 中科鼎富(北京)科技发展有限公司 File content extraction method and device, electronic equipment and storage medium
CN112036132A (en) * 2020-09-01 2020-12-04 珠海豹趣科技有限公司 Document header and footer editing method and device and electronic equipment
CN112036132B (en) * 2020-09-01 2024-04-19 珠海豹趣科技有限公司 Method and device for editing header and footer of document and electronic equipment
CN112329426A (en) * 2020-11-12 2021-02-05 北京方正印捷数码技术有限公司 Header and footer identification method, apparatus, device and medium for electronic file
CN112329426B (en) * 2020-11-12 2024-05-28 北京方正印捷数码技术有限公司 Method, device, equipment and medium for recognizing header and footer of electronic file
CN112861821A (en) * 2021-04-06 2021-05-28 刘羽 Map data reduction method based on PDF file analysis
CN112861821B (en) * 2021-04-06 2024-04-19 刘羽 Map data reduction method based on PDF file analysis
CN113989314A (en) * 2021-10-26 2022-01-28 深圳前海环融联易信息科技服务有限公司 Method for removing header and footer based on Hough transform linear detection

Similar Documents

Publication Publication Date Title
CN110543810A (en) Technology for completely identifying header and footer of PDF (Portable document Format) file
EP0854433B1 (en) Caption and photo extraction from scanned document images
EP1052593B1 (en) Form search apparatus and method
US8462394B2 (en) Document type classification for scanned bitmaps
US8401303B2 (en) Method and apparatus for identifying character areas in a document image
JP2001167131A (en) Automatic classifying method for document using document signature
US6917708B2 (en) Handwriting recognition by word separation into silhouette bar codes and other feature extraction
Antonacopoulos et al. A robust braille recognition system
CN113221711A (en) Information extraction method and device
JP2007172132A (en) Layout analysis program, layout analysis device and layout analysis method
CN102782702A (en) Paragraph recognition in an optical character recognition (OCR) process
CN115994230A (en) Intelligent archive construction method integrating artificial intelligence and knowledge graph technology
JP3485020B2 (en) Character recognition method and apparatus, and storage medium
CN108921160B (en) Book identification method, electronic equipment and storage medium
CN110287784B (en) Annual report text structure identification method
CN108052955B (en) High-precision Braille identification method and system
JP4077919B2 (en) Image processing method and apparatus and storage medium therefor
JP2000285190A (en) Method and device for identifying slip and storage medium
JP3608965B2 (en) Automatic authoring device and recording medium
CN111832497B (en) Text detection post-processing method based on geometric features
US20010043742A1 (en) Communication document detector
CN110728240A (en) Method and device for automatically identifying title of electronic file
CN115543915A (en) Automatic database building method and system for personnel file directory
CN112560849B (en) Neural network algorithm-based grammar segmentation method and system
Basu et al. Segmentation of offline handwritten Bengali script

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20191206

WD01 Invention patent application deemed withdrawn after publication