CN111444750A - PDF document identification method and device and electronic equipment - Google Patents

PDF document identification method and device and electronic equipment Download PDF

Info

Publication number
CN111444750A
CN111444750A CN201910051078.9A CN201910051078A CN111444750A CN 111444750 A CN111444750 A CN 111444750A CN 201910051078 A CN201910051078 A CN 201910051078A CN 111444750 A CN111444750 A CN 111444750A
Authority
CN
China
Prior art keywords
pages
picture
preset number
pdf document
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910051078.9A
Other languages
Chinese (zh)
Other versions
CN111444750B (en
Inventor
宁廷泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Office Software Inc
Zhuhai Kingsoft Office Software Co Ltd
Guangzhou Kingsoft Mobile Technology Co Ltd
Original Assignee
Beijing Kingsoft Office Software Inc
Zhuhai Kingsoft Office Software Co Ltd
Guangzhou Kingsoft Mobile Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Office Software Inc, Zhuhai Kingsoft Office Software Co Ltd, Guangzhou Kingsoft Mobile Technology Co Ltd filed Critical Beijing Kingsoft Office Software Inc
Priority to CN201910051078.9A priority Critical patent/CN111444750B/en
Publication of CN111444750A publication Critical patent/CN111444750A/en
Application granted granted Critical
Publication of CN111444750B publication Critical patent/CN111444750B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/60Analysis of geometric attributes
    • G06T7/62Analysis of geometric attributes of area, perimeter, diameter or volume
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30176Document
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Abstract

The embodiment of the application provides a PDF document identification method, a PDF document identification device and electronic equipment, wherein the method comprises the following steps: respectively determining character areas corresponding to all character objects and image areas corresponding to all image objects contained in a preset number of pages according to a preset area calculation algorithm; respectively determining the area ratio of the picture and the area ratio of the characters; and if the picture area ratio and the character area ratio meet the preset conditions, determining the type of the target PDF document. Therefore, in the embodiment of the application, the format type of the target PDF document is accurately determined by analyzing the picture objects and the character objects in the preset number of pages in the target PDF document at the same time according to whether the target PDF document contains the character objects or not and the proportion of the picture objects and the character objects in the preset number of pages in the target PDF document, so that the identification accuracy of the PDF document is improved.

Description

PDF document identification method and device and electronic equipment
Technical Field
The present application relates to the field of PDF document editing technologies, and in particular, to a PDF document identification method and apparatus, and an electronic device.
Background
PDF (Portable Document Format) is an electronic Document Format, and documents in this Format have many advantages in practical applications for users. For example, the document in the PDF format can retain the original appearance of the document as originally edited as possible, and avoid the change of fonts, versions and the like in the document when the same document is stored in different terminal devices.
Today, there are many methods for generating PDF documents, for example, a PDF editor directly converts a file in a file format such as word, PPT, etc. into a PDF document format. Or, the terminal device scans each page of paper-based document respectively to generate an image corresponding to the page of paper-based document, and then splices the generated images into a PDF document, for example, some electronic books scan each page of content in a paper-based book, and correspondingly generate one picture for each scanned page of content, and then splices each picture generated by the terminal device scanning into a PDF document. Different PDF document generation methods and different types of generated PDF documents, for example, a converted PDF document and a scanned PDF document. The PDF document is converted into a PDF document generated by converting the word document; and the scanned PDF document is a PDF document generated by scanning each paper version document. By identifying different types of PDF files, an effective algorithm can be effectively provided for further operation of a user on a PDF document, for example, for a character object in a scanned PDF document based on a picture generated by scanning a paper version document, the user cannot perform operations such as copying and pasting on the character object, so that after the terminal equipment identifies that the PDF document is the scanned PDF document based on the picture, the user can be prompted to adopt an OCR algorithm to further analyze the PDF document, and the operations of copying and pasting the character object contained in the PDF document are realized.
In the prior art, a method for identifying a type of a scanned PDF document based on a picture is to analyze whether the scanned PDF document based on the picture contains characters, and if not, the PDF document is determined to be the scanned PDF document based on the picture.
Disclosure of Invention
An embodiment of the application aims to provide a PDF document identification method, a PDF document identification device and electronic equipment, so that accuracy of PDF document identification is improved. The specific technical scheme is as follows:
in a first aspect, a PDF document identification method is provided, where the method includes:
acquiring a preset number of pages contained in a target PDF document;
analyzing the preset number of pages, and determining picture objects and character objects contained in the preset number of pages;
determining picture areas corresponding to all picture objects contained in the preset number of pages according to picture objects contained in the preset number of pages and a preset picture area calculation algorithm, and determining character areas corresponding to all character objects contained in the preset number of pages according to character objects contained in the preset number of pages and a preset character area calculation algorithm;
determining the ratio of the picture area to the total page area corresponding to the preset number of pages according to the total page area corresponding to the preset number of pages and the picture area corresponding to all picture objects contained in the preset number of pages, and taking the ratio as the picture area ratio;
determining the ratio of the text area to the total page area corresponding to the preset number of pages according to the total page area corresponding to the preset number of pages and the text area corresponding to all text objects contained in the preset number of pages, and taking the ratio as the text area ratio;
and if the picture area ratio and the character area ratio meet preset conditions, determining the type of the target PDF document.
Optionally, the step of determining, according to the picture objects included in the preset number of pages and a preset picture area calculation algorithm, picture areas corresponding to all the picture objects included in the preset number of pages may include:
determining the length and width corresponding to each picture object contained in the preset number of pages according to the acquired configuration file of the target PDF document;
multiplying the length and the width corresponding to each picture object contained in the preset number of pages to obtain the area corresponding to each picture object;
summing the areas corresponding to all the picture objects in the preset number of pages to obtain the picture areas corresponding to all the picture objects contained in the preset number of pages.
Optionally, the step of determining the text areas corresponding to all the text objects included in the preset number of pages according to the text objects included in the preset number of pages and a preset text area calculation algorithm may include:
determining the length and width corresponding to each character object contained in the preset number of pages according to the acquired configuration file of the target PDF document;
multiplying the length and the width corresponding to each character object contained in the preset number of pages to obtain the area corresponding to each character object;
and summing the areas corresponding to all the character objects contained in the preset number of pages to obtain the character areas corresponding to all the character objects contained in the preset number of pages.
Optionally, the step of determining the type of the target PDF document if the picture area ratio and the text area ratio satisfy a preset condition may include:
if the picture area ratio is larger than a first preset threshold value and the character area ratio is smaller than a second preset threshold value, determining that the target PDF document is of a scanned PDF document type; wherein the first preset threshold is greater than a second preset threshold.
Optionally, the types of the target PDF document include a scan PDF document and a conversion PDF document;
the method may further comprise:
and if the type of the target PDF document is a scanning PDF document, displaying preset prompt information, wherein the preset prompt information is used for prompting a user to carry out preset operation on a picture object in the scanning PDF document.
In a second aspect, a PDF document identification device is provided, the device comprising:
the acquisition module is used for acquiring a preset number of pages contained in the target PDF document;
the object determining module is used for analyzing the preset number of pages and determining the picture objects and the character objects contained in the preset number of pages;
the area determination module is used for determining the picture areas corresponding to all the picture objects contained in the preset number of pages according to the picture objects contained in the preset number of pages and a preset picture area calculation algorithm, and determining the character areas corresponding to all the character objects contained in the preset number of pages according to the character objects contained in the preset number of pages and a preset character area calculation algorithm;
a picture area ratio determining module, configured to determine, according to a total page area corresponding to the preset number of pages and picture areas corresponding to all picture objects included in the preset number of pages, a ratio between the picture area and the total page area corresponding to the preset number of pages, and use the ratio as a picture area ratio;
a text area ratio determining module, configured to determine, according to a total page area corresponding to the preset number of pages and text areas corresponding to all text objects included in the preset number of pages, a ratio between the text area and the total page area corresponding to the preset number of pages, and use the ratio as a text area ratio;
and the document type determining module is used for determining the type of the target PDF document if the picture area ratio and the character area ratio meet preset conditions.
Optionally, the area determining module may include:
the length and width determining submodule corresponding to the picture objects is used for determining the length and width corresponding to each picture object contained in the preset number of pages according to the acquired configuration file of the target PDF document;
the area determining submodule corresponding to the picture objects is used for multiplying the length and the width corresponding to each picture object contained in the preset number of pages to obtain the area corresponding to each picture object;
and the picture area determining submodule corresponding to all the picture objects is used for summing the areas corresponding to all the picture objects in the preset number of pages to obtain the picture areas corresponding to all the picture objects contained in the preset number of pages.
Optionally, the area determining module may further include:
a length and width determining submodule corresponding to the text objects, configured to determine, according to the obtained configuration file of the target PDF document, a length and a width corresponding to each text object included in the preset number of pages;
the area determining submodule corresponding to the character objects is used for multiplying the length and the width corresponding to each character object contained in the preset number of pages to obtain the area corresponding to each character object;
and the character area determining submodule corresponding to all the character objects is used for summing the areas corresponding to all the character objects contained in the preset number of pages to obtain the character areas corresponding to all the character objects contained in the preset number of pages.
Optionally, the document type determining module may include:
a scanning PDF document type determining submodule, configured to determine that the target PDF document is a scanning PDF document type if the picture area ratio is greater than a first preset threshold and the character area ratio is less than a second preset threshold; wherein the first preset threshold is greater than a second preset threshold.
Optionally, the types of the target PDF document include a scan PDF document and a conversion PDF document;
the apparatus may further include:
and the display module is used for displaying preset prompt information if the type of the target PDF document is a scanning PDF document, wherein the preset prompt information is used for prompting a user to carry out preset operation on a picture object in the scanning PDF document.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface complete communication between the memory and the processor through the communication bus;
a memory for storing a computer program;
the processor is used for realizing the following method steps when executing the program stored in the memory:
acquiring a preset number of pages contained in a target PDF document;
analyzing the preset number of pages, and determining picture objects and character objects contained in the preset number of pages;
determining picture areas corresponding to all picture objects contained in the preset number of pages according to picture objects contained in the preset number of pages and a preset picture area calculation algorithm, and determining character areas corresponding to all character objects contained in the preset number of pages according to character objects contained in the preset number of pages and a preset character area calculation algorithm;
determining the ratio of the picture area to the total page area corresponding to the preset number of pages according to the total page area corresponding to the preset number of pages and the picture area corresponding to all picture objects contained in the preset number of pages, and taking the ratio as the picture area ratio;
determining the ratio of the text area to the total page area corresponding to the preset number of pages according to the total page area corresponding to the preset number of pages and the text area corresponding to all text objects contained in the preset number of pages, and taking the ratio as the text area ratio;
and if the picture area ratio and the character area ratio meet preset conditions, determining the type of the target PDF document.
In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps of any one of the above PDF document identification methods are implemented.
In a fifth aspect, an embodiment of the present application further provides a computer program product containing instructions, which when run on a computer, causes the computer to execute any one of the above-mentioned PDF document identification methods.
The embodiment of the application provides a PDF document identification method and device and electronic equipment. Acquiring a preset number of pages contained in a target PDF document; analyzing a preset number of pages, and determining picture objects and character objects contained in the preset number of pages; respectively determining character areas corresponding to all character objects and image areas corresponding to all image objects contained in a preset number of pages according to a preset image area calculation algorithm and a character area calculation algorithm; respectively determining a picture area ratio and a character area ratio according to the total page area corresponding to the preset number of pages, the picture areas corresponding to all picture objects contained in the preset number of pages and the character areas corresponding to all character objects contained in the preset number of pages; and if the picture area ratio and the character area ratio meet the preset conditions, determining the type of the target PDF document. According to the embodiment of the application, the format type of the target PDF document is accurately determined by analyzing the picture objects and the character objects in the preset number of pages in the target PDF document at the same time according to whether the target PDF document contains the character objects or not and the proportion of the picture objects and the character objects in the preset number of pages in the target PDF document to the preset number of pages in the target PDF document, so that the accuracy of identifying the PDF document is improved.
Of course, it is not necessary for any product or method of the present application to achieve all of the above-described advantages at the same time.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a PDF document identification method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a PDF document identification device according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In order to improve the accuracy of identifying a PDF document, embodiments of the present application provide a PDF document identifying method, an apparatus, and an electronic device, which are described in detail below.
First, a method for identifying a PDF document provided in the embodiment of the present application is described below.
The application provides a PDF document identification method which can be applied to electronic equipment. The electronic device may be a device having both functions of identifying a PDF document and displaying preset prompt information. For example, the device may be a mobile phone or a tablet PC, or may be a Personal Computer (PC), a television, or the like.
Referring to fig. 1, fig. 1 is a schematic flowchart of a PDF document identification method provided in an embodiment of the present application, and may include the following steps:
s101, acquiring a preset number of pages contained in the target PDF document.
In general, a PDF document may contain one or more pages. Each page in the PDF document can only contain a picture object or a text object, and can also contain the picture object and the text object simultaneously; therefore, the PDF documents can be classified according to the proportion of the picture objects and the character objects in all the pages of the PDF documents.
In the embodiment of the present application, PDF documents can be classified into two types, i.e., scanned PDF documents and converted PDF documents. Scanning PDF documents, wherein the PDF documents are PDF documents of which the proportion of picture objects contained in the documents in all pages of the documents is greater than or equal to a preset threshold value, namely the documents contain more picture objects; otherwise, if the proportion of the picture objects contained in the document in all the pages of the document is smaller than a preset threshold value, the document is called a converted PDF document.
In practice, the number of pages to be acquired (i.e. the preset number) may be set in the electronic device. When the electronic device detects a preset operation instruction, a preset number of pages included in the target PDF document can be acquired. For example, when an instruction of the user to open the target PDF document is detected, or when an editing instruction for the target PDF document is detected. The preset number may be greater than 1, or may be any integer value not greater than the total number of pages of the target PDF document.
For example, if the preset number is 1, the electronic device only needs to acquire any page in the target PDF document, and then perform subsequent steps on the page to determine the type of the target PDF document.
Or, the preset number may be set as the number of pages corresponding to a part of pages in the target PDF document, for example, when the target PDF document contains 20 pages in total, the preset number may be set as 10, that is, 10 pages may be continuously obtained from all pages of the target PDF document, or 10 pages may be randomly obtained from the target PDF document. Similarly, the preset number may also be set as the total number of pages of the target PDF document. In the embodiment of the present application, a manner of obtaining a preset number of pages from a target PDF document and a value of the preset number are not specifically limited.
S102, analyzing the preset number of pages, and determining picture objects and character objects contained in the preset number of pages.
Generally, for each PDF document, a page identifier for distinguishing other pages is set on each page, for example, 1, 2, 3 or a, b, c, etc.; and when the PDF document is generated, a configuration file uniquely corresponding to the PDF document may be correspondingly generated, where the configuration information may include page identifiers of each page of the target PDF document, picture objects included in each page, lengths and widths corresponding to each picture object, text objects included in each page of the target PDF document, lengths and widths corresponding to each text object, and page lengths and page widths corresponding to each page of the target PDF document.
In practice, after acquiring the preset number of pages included in the target PDF document in step S101, the electronic device may acquire the configuration file of the target PDF document. And determining, by the electronic device, the picture objects and the text objects contained in each of the preset number of pages from the configuration file through the page identifiers corresponding to each of the preset number of pages, so as to determine all the picture objects and the text objects contained in the preset number of pages.
Optionally, the page identifier in the embodiment of the present application may be any identifier for distinguishing different pages, and the present application is not particularly limited.
S103, determining picture areas corresponding to all picture objects contained in the preset number of pages according to picture objects contained in the preset number of pages and a preset picture area calculation algorithm, and determining character areas corresponding to all character objects contained in the preset number of pages according to character objects contained in the preset number of pages and a preset character area calculation algorithm.
In implementation, the electronic device analyzes the preset number of pages through the step S102, and determines the picture objects and the text objects included in the preset number of pages, that is, after the picture objects and the text objects included in the preset number of pages are determined by reading information in the configuration file corresponding to the target PDF document, the length and the width of each picture object included in the preset number of pages read from the configuration file may be calculated according to a preset picture area calculation algorithm, so as to determine the picture areas corresponding to all the picture objects included in the preset number of pages.
Optionally, the following is an algorithm for calculating a picture area corresponding to a picture object provided in the embodiment of the present application, and specifically may include the following steps:
the method comprises the following steps: and determining the length and width corresponding to each picture object contained in the preset number of pages according to the acquired configuration file of the target PDF document.
In implementation, the electronic device may find, in the configuration file of the target PDF document, the picture object included in each page and the length and width corresponding to the picture object according to the corresponding relationship between the first page identifier and the second page identifier of each page in the preset number of pages. And then, executing the second step to determine the area corresponding to each picture object contained in the preset number of pages.
Step two: multiplying the length and the width corresponding to each picture object contained in the preset number of pages to obtain the area corresponding to each picture object.
For example, if two picture objects are included in one of the preset number of pages, and the two picture objects are referred to as picture object 1 and picture object 2 for distinguishing the two picture objects, it is assumed that the length of the picture object 1 is 2, the width thereof is 5, the length of the picture object 2 is 3, and the width thereof is 4, then the area of the picture object 1 is 2 × 5-10, and the area of the picture object 2 is 3 × 4-12.
Step three: summing the areas corresponding to all the picture objects in the preset number of pages to obtain the picture areas corresponding to all the picture objects contained in the preset number of pages.
Optionally, an algorithm for calculating a text area corresponding to a text object is further provided in this embodiment of the present application, which specifically includes the following steps:
the method comprises the following steps: and determining the length and width corresponding to each character object contained in the preset number of pages according to the acquired configuration file of the target PDF document.
In implementation, the determination of the length and the width corresponding to each picture object included in the preset number of pages in the above step one may be specifically performed, and will not be described in detail here.
Step two: multiplying the length and the width corresponding to each character object contained in the preset number of pages to obtain the area corresponding to each character object.
In implementation, the determination of the area corresponding to each picture object included in the preset number of pages in the second step may be specifically performed, and will not be described in detail here.
Step three: and summing the areas corresponding to all the character objects contained in the preset number of pages to obtain the character areas corresponding to all the character objects contained in the preset number of pages.
S104, determining the ratio of the picture area to the total page area corresponding to the preset number of pages according to the total page area corresponding to the preset number of pages and the picture area corresponding to all picture objects contained in the preset number of pages, and taking the ratio as the picture area ratio.
In implementation, the electronic device may determine the page area corresponding to each page in the preset number of pages by determining the picture areas corresponding to all the picture objects included in the preset number of pages in step S103. And adding the page areas corresponding to the pages in the preset number of pages, and obtaining a numerical value through summation, namely the total page area corresponding to the preset number of pages. Then, the image areas corresponding to all the image objects included in the preset number of pages determined in step S103 are divided by the total page area corresponding to the preset number of pages determined in this step, and the obtained value (or ratio) is used as the image area ratio.
S105, determining the ratio of the text area to the total page area corresponding to the preset number of pages according to the total page area corresponding to the preset number of pages and the text area corresponding to all text objects contained in the preset number of pages, and taking the ratio as the text area ratio.
In implementation, the text area ratio may be determined by using the method of determining the picture area ratio described in step S104, and will not be described in detail here.
And S106, if the picture area ratio and the character area ratio meet the preset conditions, determining the type of the target PDF document.
In practical applications, because the picture object and the text object included in each page of the target PDF document can be displayed in a page in an overlapping manner, for example, a certain page in the target PDF document includes a picture object, where the upper half of the picture object has character content and the lower half is blank, and then a text object can be overlapped in the lower half of the picture object. Of course, the picture objects and the text objects contained in each page of the target PDF document may be distributed at different positions in the page in an independent manner. Therefore, in order to accurately identify the type of the target PDF document, a preset threshold value can be respectively set for the picture area ratio and the character area ratio according to actual experience; and only when the image area ratio and the character area ratio both meet the preset threshold value, determining the target PDF document as a scanning PDF document, thereby ensuring that enough image objects are contained in the target PDF document and enough character objects are contained in the target PDF document. Several implementations of determining the type of the target PDF document according to preset conditions are provided in the embodiments of the present application as follows.
In the first mode, if the picture area ratio is greater than the first preset threshold and the character area ratio is less than the second preset threshold, the target PDF document is determined to be the scanned PDF document type.
The first preset threshold is larger than the second preset threshold.
In implementation, the electronic device may set a preset threshold for the text area ratio and the picture area ratio, and when the picture area ratio is greater than (including equal to) the first preset threshold and the text area ratio is less than (including equal to) the second preset threshold, determine the target PDF document as the scanned PDF document type, otherwise, determine the target PDF document as the converted PDF document type.
For example, it is assumed that the preset first preset threshold is 0.9 and the second preset threshold is 0.2. The character area ratio of the target PDF document determined by the steps is 0.2, the picture area ratio is 0.98, obviously, the picture area ratio 0.98 of the target PDF document is greater than the first preset threshold value 0.9, the character area ratio 0.2 of the target PDF document is less than the second preset threshold value 0.2, and meanwhile, the preset conditions are met, so that the proportion of the picture objects in the target PDF document is ensured to exceed the proportion of the character objects, and the format type of the target PDF document can be accurately determined.
And in the second mode, if the image area ratio is divided by the character area ratio, and the obtained numerical value is larger than the preset threshold value, determining that the target PDF document is the type of the scanned PDF document.
When the method is implemented, the electronic equipment carries out division operation on the picture area ratio and the character area ratio, if the obtained numerical value is larger than a preset threshold value, the target PDF document is determined to be the type of the scanned PDF document, and otherwise, the target PDF document is determined to be the type of the converted PDF document.
Wherein, the value of the preset threshold is a rational number which is larger than 1.
And thirdly, making a difference between the image area ratio and the character area ratio, and if the numerical value obtained by making the difference between the image area ratio and the character area ratio is larger than a preset threshold value, determining the target PDF document as the type of the scanned PDF document.
In practice, the electronic device calculates the difference between the picture area ratio and the text area ratio. And if the difference is larger than or equal to a preset threshold value, determining the type of the target PDF document as a scanning PDF document. Otherwise, determining the type of the target PDF document as a conversion PDF document; the preset threshold value can be set according to actual needs.
For example, assume that the preset threshold is 0.7. And if the character area ratio of the target PDF document determined by the steps is 0.2 and the picture area ratio is 0.8, making a difference (0.8-0.2) between the picture area ratio and the character area ratio, and obtaining a numerical value of 0.6. Since the difference is less than the preset threshold value of 0.7, the type of the target PDF document is a converted PDF document.
Optionally, after determining the type of the target PDF document through steps S101 to S106, the user may perform different preset operations on different types of target PDF documents. The embodiment of the present application provides an implementation manner of how a user performs a preset operation on different types of target PDF documents, which specifically includes the following steps:
the method comprises the following steps: and if the type of the target PDF document is the scanning PDF document, displaying preset prompt information.
The preset prompt information can be used for prompting a user to perform preset operation on a picture object in a scanned PDF document.
In implementation, the electronic device may preset prompt information containing different contents for different PDF document types. When the electronic device determines that the type of the target PDF document is the scanned PDF document, the displayed preset prompt information may include an effective algorithm identifier for performing operations such as copying and pasting on a picture object in the scanned PDF document. The algorithm corresponding to the algorithm identification is an algorithm for converting a picture object in a scanned PDF document into a text object. An OCR algorithm is a common algorithm currently used for scanning a picture object in a PDF document. The OCR algorithm identifies a calling interface that can be the OCR algorithm so that when the electronic device checks for a user-entered opening instruction, a pre-set OCR algorithm model can be called.
The user can view the preset prompt information through the display interface of the electronic equipment. Therefore, after the user views the preset prompting message, the OCR algorithm model can be opened by clicking the OCR algorithm identifier in the preset prompting message. Then, the picture object to be processed in the target PDF document is input into the OCR algorithm model, so that the picture object is converted into a character object. Then, the text content in the converted picture object is copied and pasted. For example, a certain page in a certain scanned PDF document includes a picture object 1, and the picture object 1 is a picture including a segment of text. The image object 1 is converted into a character object by an OCR algorithm, and then, the user can perform operations such as copying and pasting on the character content included in the converted image object 1.
Since the converted PDF document may only contain text objects, it may also contain a large part of text objects and a small part of picture objects. Therefore, when it is determined through the above steps S101 to S106 that the type of the target PDF document is the conversion PDF document and the picture object ratio of the target PDF document is 0, the preset prompting information may be used to prompt the user to perform a preset operation on the text object in the conversion PDF document. When the type of the target PDF document is determined to be the converted PDF document and the picture object ratio of the target PDF document is not 0 through the above steps, the preset prompt information may include the OCR algorithm identifier.
According to the PDF document identification method provided by the embodiment of the application, the preset number of pages contained in the target PDF document can be obtained; analyzing a preset number of pages, and determining picture objects and character objects contained in the preset number of pages; respectively determining character areas corresponding to all character objects and image areas corresponding to all image objects contained in a preset number of pages according to a preset image area calculation algorithm and a character area calculation algorithm; respectively determining a picture area ratio and a character area ratio according to the total page area corresponding to the preset number of pages, the picture areas corresponding to all picture objects contained in the preset number of pages and the character areas corresponding to all character objects contained in the preset number of pages; and if the picture area ratio and the character area ratio meet the preset conditions, determining the type of the target PDF document. In the embodiment of the present application, the format type of the target PDF document is accurately determined by analyzing the proportions of the picture objects and the text objects in the preset number of pages in the target PDF document respectively in the preset number of pages of the target PDF document, not only according to whether the target PDF document contains the text objects, but also according to whether the target PDF document contains the text objects, so that the accuracy of identifying the PDF document is improved.
Based on the same technical concept, corresponding to the method embodiment shown in fig. 1, the present application further provides a PDF document identification device, as shown in fig. 2, where the device includes:
an obtaining module 201, configured to obtain a preset number of pages included in a target PDF document;
the object determination module 202 is configured to analyze a preset number of pages, and determine picture objects and text objects included in the preset number of pages;
the area determining module 203 is configured to determine, according to the picture objects included in the preset number of pages and a preset picture area calculation algorithm, picture areas corresponding to all the picture objects included in the preset number of pages, and determine, according to the text objects included in the preset number of pages and a preset text area calculation algorithm, text areas corresponding to all the text objects included in the preset number of pages;
a picture area ratio determining module 204, configured to determine, according to the total page area corresponding to the preset number of pages and the picture areas corresponding to all picture objects included in the preset number of pages, a ratio between the picture area and the total page area corresponding to the preset number of pages, and use the ratio as a picture area ratio;
a text area ratio determining module 205, configured to determine, according to a total page area corresponding to a preset number of pages and text areas corresponding to all text objects included in the preset number of pages, a ratio between the text area and the total page area corresponding to the preset number of pages, and use the ratio as a text area ratio;
and the document type determining module 206 is configured to determine the type of the target PDF document if the picture area ratio and the character area ratio satisfy a preset condition.
In an embodiment of the present application, the area determining module may include:
the length and width determining submodule corresponding to the picture objects is used for determining the length and width corresponding to each picture object contained in the preset number of pages according to the acquired configuration file of the target PDF document;
the area determining submodule corresponding to the picture objects is used for multiplying the length and the width corresponding to each picture object contained in the preset number of pages to obtain the area corresponding to each picture object;
and the picture area determining submodule corresponding to all the picture objects is used for summing the areas corresponding to all the picture objects in the preset number of pages to obtain the picture areas corresponding to all the picture objects contained in the preset number of pages.
In this embodiment of the application, the area determining module may further include:
the length and width determining submodule corresponding to the character objects is used for determining the length and width corresponding to each character object contained in the preset number of pages according to the acquired configuration file of the target PDF document;
the area determining submodule corresponding to the character objects is used for multiplying the length and the width corresponding to each character object contained in the preset number of pages to obtain the area corresponding to each character object;
and the character area determining submodule corresponding to all the character objects is used for summing the areas corresponding to all the character objects contained in the preset number of pages to obtain the character areas corresponding to all the character objects contained in the preset number of pages.
In an embodiment of the present application, the document type determining module may include:
the scanning PDF document type determining sub-module is used for determining that the target PDF document is the scanning PDF document type if the picture area ratio is larger than a first preset threshold value and the character area ratio is smaller than a second preset threshold value; the first preset threshold is larger than the second preset threshold.
In the embodiment of the present application, the type of the target PDF document may include a scan PDF document and a conversion PDF document;
the apparatus may further include:
and the display module is used for displaying preset prompt information if the type of the target PDF document is a scanning PDF document, wherein the preset prompt information is used for prompting a user to carry out preset operation on a picture object in the scanning PDF document.
The embodiment of the present application further provides an electronic device, as shown in fig. 3, which includes a processor 301, a communication interface 302, a memory 303, and a communication bus 304, where the processor 301, the communication interface 302, and the memory 303 complete mutual communication through the communication bus 304,
a memory 303 for storing a computer program;
the processor 301, when executing the program stored in the memory 303, implements the following steps:
acquiring a preset number of pages contained in a target PDF document;
analyzing a preset number of pages, and determining picture objects and character objects contained in the preset number of pages;
determining picture areas corresponding to all picture objects contained in the preset number of pages according to picture objects contained in the preset number of pages and a preset picture area calculation algorithm, and determining character areas corresponding to all character objects contained in the preset number of pages according to character objects contained in the preset number of pages and a preset character area calculation algorithm;
determining the ratio of the picture area to the total page area corresponding to the preset number of pages according to the total page area corresponding to the preset number of pages and the picture area corresponding to all picture objects contained in the preset number of pages, and taking the ratio as the picture area ratio;
determining the ratio of the text area to the total page area corresponding to the preset number of pages according to the total page area corresponding to the preset number of pages and the text area corresponding to all text objects contained in the preset number of pages, and taking the ratio as the text area ratio;
and if the picture area ratio and the character area ratio meet the preset conditions, determining the type of the target PDF document.
For specific implementation and related explanation of each step of the method, reference may be made to the method embodiment shown in fig. 1, which is not described herein again.
In addition, other implementation manners of the method implemented by the processor 301 executing the program stored in the memory 303 are the same as those mentioned in the foregoing method embodiment, and are not described herein again.
The communication bus mentioned in the above PDF document identification method may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.
In another embodiment provided by the present application, a computer-readable storage medium is further provided, which stores instructions that, when executed on a computer, cause the computer to execute any one of the above-mentioned PDF document identification methods.
In yet another embodiment provided by the present application, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the PDF document identification methods of the above embodiments.
The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, e.g., from one website site, computer, server, or data center, via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DS L)) or wireless (e.g., infrared, wireless, microwave, etc.) means to another website site, computer, server, or data center.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus, the electronic device, and the computer-readable storage medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and in relation to the description, reference may be made to some portions of the description of the method embodiments.
The above embodiments are merely preferred embodiments of the present application, and are not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims (11)

1. A PDF document identification method is characterized by comprising the following steps:
acquiring a preset number of pages contained in a target PDF document;
analyzing the preset number of pages, and determining picture objects and character objects contained in the preset number of pages;
determining picture areas corresponding to all picture objects contained in the preset number of pages according to picture objects contained in the preset number of pages and a preset picture area calculation algorithm, and determining character areas corresponding to all character objects contained in the preset number of pages according to character objects contained in the preset number of pages and a preset character area calculation algorithm;
determining the ratio of the picture area to the total page area corresponding to the preset number of pages according to the total page area corresponding to the preset number of pages and the picture area corresponding to all picture objects contained in the preset number of pages, and taking the ratio as the picture area ratio;
determining the ratio of the text area to the total page area corresponding to the preset number of pages according to the total page area corresponding to the preset number of pages and the text area corresponding to all text objects contained in the preset number of pages, and taking the ratio as the text area ratio;
and if the picture area ratio and the character area ratio meet preset conditions, determining the type of the target PDF document.
2. The method according to claim 1, wherein the step of determining picture areas corresponding to all picture objects included in the preset number of pages according to the picture objects included in the preset number of pages and a preset picture area calculation algorithm comprises:
determining the length and width corresponding to each picture object contained in the preset number of pages according to the acquired configuration file of the target PDF document;
multiplying the length and the width corresponding to each picture object contained in the preset number of pages to obtain the area corresponding to each picture object;
summing the areas corresponding to all the picture objects in the preset number of pages to obtain the picture areas corresponding to all the picture objects contained in the preset number of pages.
3. The method according to claim 1, wherein the step of determining the text areas corresponding to all the text objects included in the preset number of pages according to the text objects included in the preset number of pages and a preset text area calculation algorithm comprises:
determining the length and width corresponding to each character object contained in the preset number of pages according to the acquired configuration file of the target PDF document;
multiplying the length and the width corresponding to each character object contained in the preset number of pages to obtain the area corresponding to each character object;
and summing the areas corresponding to all the character objects contained in the preset number of pages to obtain the character areas corresponding to all the character objects contained in the preset number of pages.
4. The method according to claim 1, wherein the step of determining the type of the target PDF document if the picture area ratio and the text area ratio satisfy a preset condition comprises:
if the picture area ratio is larger than a first preset threshold value and the character area ratio is smaller than a second preset threshold value, determining that the target PDF document is of a scanned PDF document type; wherein the first preset threshold is greater than a second preset threshold.
5. The method according to claim 1, wherein the type of the target PDF document comprises a scan PDF document and a convert PDF document;
the method further comprises the following steps:
and if the type of the target PDF document is a scanning PDF document, displaying preset prompt information, wherein the preset prompt information is used for prompting a user to carry out preset operation on a picture object in the scanning PDF document.
6. A PDF document identification apparatus, comprising:
the acquisition module is used for acquiring a preset number of pages contained in the target PDF document;
the object determining module is used for analyzing the preset number of pages and determining the picture objects and the character objects contained in the preset number of pages;
the area determination module is used for determining the picture areas corresponding to all the picture objects contained in the preset number of pages according to the picture objects contained in the preset number of pages and a preset picture area calculation algorithm, and determining the character areas corresponding to all the character objects contained in the preset number of pages according to the character objects contained in the preset number of pages and a preset character area calculation algorithm;
a picture area ratio determining module, configured to determine, according to a total page area corresponding to the preset number of pages and picture areas corresponding to all picture objects included in the preset number of pages, a ratio between the picture area and the total page area corresponding to the preset number of pages, and use the ratio as a picture area ratio;
a text area ratio determining module, configured to determine, according to a total page area corresponding to the preset number of pages and text areas corresponding to all text objects included in the preset number of pages, a ratio between the text area and the total page area corresponding to the preset number of pages, and use the ratio as a text area ratio;
and the document type determining module is used for determining the type of the target PDF document if the picture area ratio and the character area ratio meet preset conditions.
7. The apparatus of claim 6, wherein the area determination module comprises:
the length and width determining submodule corresponding to the picture objects is used for determining the length and width corresponding to each picture object contained in the preset number of pages according to the acquired configuration file of the target PDF document;
the area determining submodule corresponding to the picture objects is used for multiplying the length and the width corresponding to each picture object contained in the preset number of pages to obtain the area corresponding to each picture object;
and the picture area determining submodule corresponding to all the picture objects is used for summing the areas corresponding to all the picture objects in the preset number of pages to obtain the picture areas corresponding to all the picture objects contained in the preset number of pages.
8. The apparatus of claim 6, wherein the area determination module further comprises:
a length and width determining submodule corresponding to the text objects, configured to determine, according to the obtained configuration file of the target PDF document, a length and a width corresponding to each text object included in the preset number of pages;
the area determining submodule corresponding to the character objects is used for multiplying the length and the width corresponding to each character object contained in the preset number of pages to obtain the area corresponding to each character object;
and the character area determining submodule corresponding to all the character objects is used for summing the areas corresponding to all the character objects contained in the preset number of pages to obtain the character areas corresponding to all the character objects contained in the preset number of pages.
9. The apparatus of claim 6, wherein the document type determination module comprises:
a scanning PDF document type determining submodule, configured to determine that the target PDF document is a scanning PDF document type if the picture area ratio is greater than a first preset threshold and the character area ratio is less than a second preset threshold; wherein the first preset threshold is greater than a second preset threshold.
10. The apparatus according to claim 6, wherein the type of the target PDF document comprises a scan PDF document and a convert PDF document;
the device further comprises:
and the display module is used for displaying preset prompt information if the type of the target PDF document is a scanning PDF document, wherein the preset prompt information is used for prompting a user to carry out preset operation on a picture object in the scanning PDF document.
11. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any one of claims 1 to 5 when executing a program stored in the memory.
CN201910051078.9A 2019-01-17 2019-01-17 PDF document identification method and device and electronic equipment Active CN111444750B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910051078.9A CN111444750B (en) 2019-01-17 2019-01-17 PDF document identification method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910051078.9A CN111444750B (en) 2019-01-17 2019-01-17 PDF document identification method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN111444750A true CN111444750A (en) 2020-07-24
CN111444750B CN111444750B (en) 2023-03-21

Family

ID=71652407

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910051078.9A Active CN111444750B (en) 2019-01-17 2019-01-17 PDF document identification method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN111444750B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112380824A (en) * 2020-10-09 2021-02-19 北京中科凡语科技有限公司 PDF document processing method, device, equipment and storage medium for automatically identifying columns
CN112528599A (en) * 2020-12-15 2021-03-19 信号旗智能科技(上海)有限公司 Multi-page document processing method, apparatus, computer device and medium based on XML
CN113177541A (en) * 2021-05-17 2021-07-27 上海云扩信息科技有限公司 Method for extracting character contents in PDF document and picture by computer program
CN113536771A (en) * 2021-09-17 2021-10-22 深圳前海环融联易信息科技服务有限公司 Element information extraction method, device, equipment and medium based on text recognition
CN113792659A (en) * 2021-09-15 2021-12-14 上海金仕达软件科技有限公司 Document identification method and device and electronic equipment
CN112528599B (en) * 2020-12-15 2024-05-10 信号旗智能科技(上海)有限公司 XML-based multi-page document processing method, device, computer equipment and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100021069A1 (en) * 2008-07-22 2010-01-28 Xerox Corporation Pdf de-chunking and object classification
US20170270359A1 (en) * 2016-03-18 2017-09-21 Ricoh Company, Ltd. Document type recognition apparatus, image forming apparatus, document type recognition method, and computer program product

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100021069A1 (en) * 2008-07-22 2010-01-28 Xerox Corporation Pdf de-chunking and object classification
US20170270359A1 (en) * 2016-03-18 2017-09-21 Ricoh Company, Ltd. Document type recognition apparatus, image forming apparatus, document type recognition method, and computer program product

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112380824A (en) * 2020-10-09 2021-02-19 北京中科凡语科技有限公司 PDF document processing method, device, equipment and storage medium for automatically identifying columns
CN112528599A (en) * 2020-12-15 2021-03-19 信号旗智能科技(上海)有限公司 Multi-page document processing method, apparatus, computer device and medium based on XML
CN112528599B (en) * 2020-12-15 2024-05-10 信号旗智能科技(上海)有限公司 XML-based multi-page document processing method, device, computer equipment and medium
CN113177541A (en) * 2021-05-17 2021-07-27 上海云扩信息科技有限公司 Method for extracting character contents in PDF document and picture by computer program
CN113177541B (en) * 2021-05-17 2023-12-19 上海云扩信息科技有限公司 Method for extracting text content in PDF document and picture by computer program
CN113792659A (en) * 2021-09-15 2021-12-14 上海金仕达软件科技有限公司 Document identification method and device and electronic equipment
CN113792659B (en) * 2021-09-15 2024-04-05 上海金仕达软件科技股份有限公司 Document identification method and device and electronic equipment
CN113536771A (en) * 2021-09-17 2021-10-22 深圳前海环融联易信息科技服务有限公司 Element information extraction method, device, equipment and medium based on text recognition

Also Published As

Publication number Publication date
CN111444750B (en) 2023-03-21

Similar Documents

Publication Publication Date Title
CN111444750B (en) PDF document identification method and device and electronic equipment
CN106484266B (en) Text processing method and device
CN109410932B (en) Voice operation method and device based on HTML5 webpage
CN111310750B (en) Information processing method, device, computing equipment and medium
CN111414727B (en) Editing method and device for PDF document header footer and electronic equipment
CN107992631B (en) File management method and terminal
CN103678600A (en) Webpage data processing method and equipment
CN110968374A (en) Document information display method and device, electronic equipment and storage medium
CN104899203B (en) Webpage generation method and device and terminal equipment
CN114359533B (en) Page number identification method based on page text and computer equipment
CN104750667A (en) Image content processing method and mobile terminal
CN110970011A (en) Picture processing method, device and equipment and computer readable storage medium
CN111669312A (en) Message interaction method, electronic device and medium
CN110929479A (en) Method and device for converting PDF scanning piece, electronic equipment and storage medium
CN112000257A (en) Method and device for exporting key contents of document
CN111199136A (en) Document content display method, device and equipment
CN114741144A (en) Web-side complex table display method, device and system
CN112784527A (en) Document merging method and device and electronic equipment
CN108595569B (en) File path copying method, file path copying device and mobile terminal
CN110633457A (en) Content replacement method and device, electronic equipment and readable storage medium
CN111949184A (en) Method and device for creating new document
CN111695371B (en) Table identification method and device, electronic equipment and storage medium
CN112100122B (en) Method and device for storing picture
CN113051235A (en) Document loading method and device, terminal and storage medium
CN112950167A (en) Design service matching method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant