CN110543810A

CN110543810A - Technology for completely identifying header and footer of PDF (Portable document Format) file

Info

Publication number: CN110543810A
Application number: CN201910587311.5A
Authority: CN
Inventors: 徐茂龙; 杨鸿健; 程晨
Original assignee: Nanjing Zhilu Information Technology Co ltd
Current assignee: Nanjing Zhilu Information Technology Co ltd
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2019-12-06

Abstract

A PDF file header and footer identification method comprises the following steps: and analyzing the PDF to obtain PDF original storage data, and splitting according to each page. Identifying header footers according to the sequence of data stored in the PDF pages; the method is characterized in that: in the case of no header footer, the data of the PDF document is stored one by one from top to bottom and from left to right, but in the case of a header footer, the PDF document stores the header first and then the footer, and then the body data portion. And acquiring a header and footer according to the document data sequence and the position of the bottommost line data of the page. And judging according to the distance from the text data to the bottom end, and identifying and acquiring header and footer according to the characteristics of the PDF file in the pure picture format. The method comprises the following steps: header and tail features of the page are searched, analysis is carried out according to the feature conditions of multiple pages, and various header and tail forms are classified.

Description

technology for completely identifying header and footer of PDF (Portable document Format) file

The technical field is as follows:

The invention relates to a header and footer data separation processing method of PDF (Portable document Format).

background art:

1. at present, almost all education papers, public company announcements are distributed in channels such as a learning network, an exchange, a deep exchange and the like in a PDF file format, the format is convenient for people to read across equipment, but for data acquisition through documents, the data extraction is complex, such as a large sea fishing needle, and no structured data exists;

2. the structural extraction of the PDF file is processed by cutting out header and footer areas to avoid pollution to the main content of the original text;

3. for a PDF file with a pure picture format, image recognition (OCR) needs to be performed on page content to acquire all frame lines, text coordinate data and text content;

4. For a PDF file with a normal format, open source software such as PDF.

the invention content is as follows:

The application provides a header and footer identification method and a header and footer identification device of a PDF document, which are mainly divided into two processing modes:

1. normal format PDF file processing

(1) acquiring original analysis data of a PDF file;

(2) judging whether the data belongs to a header footer or not according to the sequence of the analyzed data and the distance between the data and the bottom of the page;

(3) For a header belonging to the upper half of the page;

(4) For the subpage pin in the lower half of the page;

(5) for those areas in the middle of the header footer that are too small, the reacquisition is performed according to the following algorithm.

2. Header and footer identification for pure pictures or abnormal formats

(1) searching header and footer characteristics of the head and the tail of the page;

(2) Analyzing according to the characteristic conditions of the multiple pages;

(3) the classification is performed for various header and footer forms.

description of the drawings:

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a flowchart illustrating a complete header and footer identification technique for a PDF file according to the present invention.

the specific implementation method comprises the following steps:

PDF file with normal non-pure picture (scanning piece) format

1. Js open source software is used for acquiring original data after PDF file analysis, and coordinate data of the text is extracted according to the sequence of the analyzed data.

The coordinate data includes

x: distance from left frame of page

y: distance from bottom of page

w: width of text data

h: height of text data

2. Judging whether a header and a footer exist:

(1) Setting id of the extracted text coordinates according to the sequence;

(2) then sorting according to the y value to obtain the minimum y value y _ min;

(3) the last _ y value of the text label with the largest id (the last label of the page in the original order) is compared with y _ min

difference is last _ y-y _ min

(4) If the difference value is lower than a certain threshold value, namely the same line is obtained, no header and no footer exist;

(5) if the difference value is higher than a certain threshold value, the two rows are not the same, and a header and a footer exist;

3. if there is a header footer, find the area of the header and footer (y value)

(1) Finding other texts on the same line of the text with the minimum y value;

(2) finding the maximum value of id row _ id _ max in the text line, wherein the part of id in the page, which belongs to the header and is less than or equal to the maximum value of id row _ id _ max, belongs to the header and footer;

(3) wherein, the upper half area of the page belongs to a header, and the lower half area of the page belongs to a footer;

(4) Finding the minimum value of the y value of the header and the maximum value of the y value of the footer;

4. And (3) correction:

the middle area of the header and the footer needs to be more than half of the page,

if the minimum value of the y value of the header minus the maximum value of the footer is less than half, the header footer data just acquired is discarded. And then acquiring a header footer by using the following characteristic identification mode.

second, PDF file in pure picture (scanning element) format or header and footer which can not be obtained by first method

1. If the picture is pure picture, the coordinates of the line and the coordinates of the character block are obtained through image recognition

2. finding the characteristics of a header and a footer:

(1) If the ratio of the occurrence of the following characteristics of the upper one third or the lower one third of each page reaches a certain threshold value, determining that the page header and the page footer are page headers

(2) feature(s)

a if there is a horizontal line and the horizontal line is relatively fixed, then it is determined as the boundary of the header and the footer, the upper part of the line is the header, the lower part of the line is the footer

b, under the condition that no horizontal line exists, judging the first three lines of texts on the upper half part and the last five lines of texts on the lower half part of each page:

i if a text block with similar characteristics such as centering, left centering, right centering and the like appears at the same position (one third of the upper half or one third of the lower half) of each page, and the width and the height occupied by the text block are similar, the page is a header and a footer;

ii, performing digital feature recognition on the character string contents of the text blocks, and if the character string contents conform to the features of continuous numbers, containing header and footer of page numbers.

(3) Correction

Different header and footer modes may exist in one pdf and can be classified

the specific classification algorithm is as follows:

a, according to a certain threshold value, obtaining relevant data of a header and footer text block preliminarily obtained, including: number, position, height, width, cross line position,

b, classifying each data according to a certain difference value, wherein each data classification has a corresponding id;

c, generating digital character strings by the ids, and removing the duplication to obtain header and footer types and the specific occurrence times of each type;

d confirming the header footer according to the occurrence times and the proportion.

Claims

1. a method for completely identifying headers and footers of a PDF file is characterized by comprising the following steps:

(1) Acquiring original data by utilizing pdf.js open source software aiming at a PDF file in a normal non-pure picture (swept area) format;

(2) judging whether a header and a footer exist or not;

(3) Searching a longitudinal range value of a header footer in the page;

(4) correcting the search result 1;

(5) aiming at the PDF file with larger result deviation or pure picture format, acquiring relevant information such as line segments, texts and the like in a page by utilizing image identification;

(6) searching a header and a footer and characteristics thereof in the page;

(7) the search result 2 is corrected.

2. the method of claim 1, wherein the step of determining whether a header footer exists comprises:

(1) Setting id of the extracted text according to the sequence of the text in the file;

(2) sequencing the texts from small to large according to the obtained text longitudinal position values to obtain the minimum longitudinal position value of the texts in the page;

(3) comparing the longitudinal position value of the text with the maximum id value with the result, and judging whether the two sections of texts are in the same line or not according to the comparison of the difference value of the longitudinal position value of the text with the maximum id value and a threshold value;

(4) If the two sections of texts are not in the same line, a header footer exists in the current page, otherwise, the header footer does not exist.

3. the method of claim 1, wherein the step of finding a value of a vertical extent of a header footer within a page comprises:

(1) Aiming at the result of claim 2, if the current page has a header footer, searching other texts in the same line according to the text with the minimum longitudinal position value in the page, and obtaining the corresponding maximum id value in the texts;

(2) and all the texts with the id values smaller than the searched id values are header and footer of the current page, judging whether the texts are the headers or the footers according to the longitudinal position values of the texts, and simultaneously acquiring the minimum longitudinal position value of the headers and the maximum longitudinal position value of the footers.

4. the method according to claim 1, wherein the decision criterion for modifying the search result 1 is:

(1) the middle area of the header and the footer needs to be more than half of the page;

(2) if the minimum value of the y value of the header minus the maximum value of the footer is less than half, the header footer data just acquired is discarded.

5. the method of claim 1, wherein the step of finding the top and bottom of the page and its features comprises:

the header footer is located according to the following specified features:

(1) if there is a horizontal line and the position of the horizontal line is relatively fixed, the boundary of the header and the footer is judged,

the upper part of the line of the upper half part is a header, and the lower part of the line of the lower half part is a footer;

(2) in the case of no horizontal line, the judgment is made for the first three lines of text in the upper half and the last five lines of text in the lower half of each page:

(i) if a text block with similar characteristics such as center, left, right and the like appears at the same position (the upper half of one third or the lower half of one third) of each page, and the width and the height occupied by the text block are similar, the page is a header and a footer;

(ii) and performing digital feature recognition on the character string contents of the text blocks, and if the character string contents conform to the features of continuous numbers, containing header and footer of page numbers.

6. the method of claim 1, wherein the step of modifying the search result 2 comprises:

classifying the header and footer modes in the PDF file, which comprises the following steps:

(1) according to a certain threshold value, the method for obtaining the related data of the header and footer text blocks preliminarily comprises the following steps: number, position, height, width, and cross line position;

classifying each data according to a certain difference value, wherein the classification part of each data has a corresponding id;

(2) Generating digital character strings by the ids, and removing the duplication to obtain header and footer types and the specific occurrence times of each type;

(3) And confirming the header footer according to the occurrence times and the proportion.