CN114330240A - PDF document analysis method and device, computer equipment and storage medium - Google Patents
PDF document analysis method and device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN114330240A CN114330240A CN202111402342.2A CN202111402342A CN114330240A CN 114330240 A CN114330240 A CN 114330240A CN 202111402342 A CN202111402342 A CN 202111402342A CN 114330240 A CN114330240 A CN 114330240A
- Authority
- CN
- China
- Prior art keywords
- analysis
- target
- pdf document
- template
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 173
- 238000000034 method Methods 0.000 claims abstract description 38
- 238000004590 computer program Methods 0.000 claims description 8
- 238000003672 processing method Methods 0.000 claims 2
- 238000010586 diagram Methods 0.000 description 9
- 238000004891 communication Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 3
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Landscapes
- Machine Translation (AREA)
Abstract
The embodiment of the application belongs to the field of data processing, and relates to a PDF document analysis method and device, computer equipment and a storage medium. The PDF document analysis method comprises the steps of obtaining configuration information of PDF document analysis; the configuration information comprises an analysis template and an analysis rule and a storage rule corresponding to the analysis template; when a PDF document to be analyzed is received, identifying key information in the PDF document to be analyzed, matching the key information with an analysis template, and calling a target analysis template matched with the key information; analyzing the PDF document to be analyzed according to the analysis rule of the target analysis template to generate analysis data; and storing the analysis data into a target database according to the storage rule of the target analysis template. According to the embodiment of the application, the user can configure the analysis process of the PDF document according to the own requirement, the analysis data generated by analysis can be stored according to the requirement of the storage rule configured by the user, the subsequent data processing work can be greatly simplified, and the processing efficiency is improved.
Description
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for parsing a PDF document, a computer device, and a storage medium.
Background
Most of the existing PDF (Portable Document Format) parsing methods use a general parsing process to parse the content in the PDF and separate the original information such as text stream and image from the PDF content stream.
However, the PDF document is widely used in various fields as an international electronic document. The general analysis flow has no way to meet different refinement requirements of various industries.
Disclosure of Invention
The embodiment of the application provides a PDF document analysis method, a device, computer equipment and a storage medium, so as to realize personalized analysis, and the technical scheme is as follows:
in a first aspect, an embodiment of the present application provides a PDF document parsing method, including:
acquiring configuration information for PDF document analysis; the configuration information comprises an analysis template and an analysis rule and a storage rule corresponding to the analysis template;
when a PDF document to be analyzed is received, identifying key information in the PDF document to be analyzed, matching the key information with an analysis template, and calling a target analysis template matched with the key information;
analyzing the PDF document to be analyzed according to the analysis rule of the target analysis template to generate analysis data;
and storing the analysis data into a target database according to the storage rule of the target analysis template.
In one embodiment, obtaining configuration information for PDF document parsing includes:
acquiring identity information of a user;
and acquiring the configuration information bound with the identity information according to the identity information.
In one embodiment, identifying key information in a PDF document to be parsed, matching the key information with a parsing template, and retrieving a target parsing template matched with the key information specifically includes:
identifying block elements in the PDF document to be analyzed, wherein the block elements comprise characters, tables or graphs;
identifying key information contained in the block elements, matching the key information with an analysis template, and calling a target analysis template matched with the key information; wherein,
under the condition that the block elements are identified as characters, reading character contents in the characters, matching the character contents with an analysis template, and calling a target analysis template matched with the character contents;
reading header information in the table in the case that the block element is identified as the table; matching the header information with an analysis template, and calling a target analysis template matched with the header information;
reading character information or pattern information in the graph under the condition that the block element is identified as the graph; matching the character information or the pattern information with the analysis template, and calling the target analysis template matched with the character information or the pattern information.
In one embodiment, parsing the PDF document to be parsed according to the parsing rule of the target parsing template, and generating parsing data includes:
reading the content of the target position according to the target position set by the analysis rule, and generating analysis data; and/or the presence of a gas in the gas,
and reading the content of the target position according to the target position and the target processing mode set by the analysis rule, and performing data processing according to the target processing mode to generate analysis data.
In one embodiment, the reading the content of the target position and performing data processing according to the target processing mode according to the target position and the target processing mode set by the analysis rule, and the generating of the analysis data specifically includes:
and under the condition that the content of the read target position is the option frame, acquiring RGB values of the option frame, determining the state of the option frame according to the RGB values, and generating analysis data.
In one embodiment, the configuration information further includes a data archiving rule, and after storing the parsed data in the target database according to the storage rule, the method further includes:
and archiving or clearing the analytic data in the target database according to the data archiving rule.
In one embodiment, when a PDF document to be parsed is received, identifying key information in the PDF document to be parsed, and before invoking a target parsing template, a target parsing rule and a target storage rule that are matched with the key information according to the key information, the method further includes:
acquiring format information of a PDF document to be analyzed;
and reading the content in the PDF document to be analyzed by adopting a reading mode corresponding to the format information according to the format information.
In a second aspect, an embodiment of the present application provides a PDF document parsing apparatus, including:
the acquisition module is used for acquiring the configuration information of PDF document analysis; the configuration information comprises an analysis template and an analysis rule and a storage rule corresponding to the analysis template;
the matching module is used for identifying key information in the PDF document to be analyzed when the PDF document to be analyzed is received, matching the key information with the analysis template and calling a target analysis template matched with the key information;
the analysis module is used for analyzing the PDF document to be analyzed according to the analysis rule of the target analysis template to generate analysis data;
and the storage module is used for storing the analysis data to the target database according to the storage rule of the target analysis template.
In a third aspect, an embodiment of the present application provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the PDF document parsing method in any of the above embodiments when executing the computer program.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements the steps of the PDF document parsing method in any of the above embodiments.
Compared with the prior art, the embodiment of the application has the following beneficial effects:
by adopting the technical scheme, the user can configure the analysis process of the PDF document according to the requirement of the user, the analysis data generated by analysis can be stored according to the requirement of the storage rule configured by the user, the subsequent data processing work can be greatly simplified, and the processing efficiency is improved.
The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present application will be readily apparent by reference to the drawings and following detailed description.
Drawings
In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a PDF document parsing method according to the application;
FIG. 3 is an exemplary diagram of a parsing template for a PDF document parsing method according to the application;
FIG. 4 is a flow diagram of another embodiment of a PDF document parsing method according to the present application;
FIG. 5 is a schematic diagram of a PDF document parsing apparatus according to one embodiment of the present application;
FIG. 6 is a schematic structural diagram of another embodiment of a PDF document parsing apparatus according to the application;
FIG. 7 is a schematic block diagram of one embodiment of a computer device according to the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that, the PDF document parsing method provided in the embodiments of the present application is generally executed by a server/terminal device, and accordingly, the PDF document parsing apparatus is generally disposed in the server/terminal device.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
FIG. 2 shows a flow diagram of one embodiment of a PDF document parsing method according to an embodiment of the application. As shown in fig. 2, the PDF document parsing method according to the embodiment of the present application includes the following steps:
s210, acquiring configuration information for PDF document analysis; the configuration information comprises an analysis template and an analysis rule and a storage rule corresponding to the analysis template.
In the embodiment of the application, a user can configure PDF document analysis through terminal equipment. The parsing template can be a document template which is used by a user in normal operation, and each user can configure the parsing template according to the document template which is used by the user, such as an asset and debt table, a profit list, a debit note or a payment note which are used by the user.
After the user uses the document template as the parsing template, the user may configure parsing rules corresponding to the parsing template, and take the document template shown in fig. 3 as an exemplary description, where the document template is a payment form, and the parsing rules may be configured as: 1) reading the content of the corresponding number of the second row and the second column; 2) reading the content of the third row corresponding to the "date"; 3) reading the content of the corresponding 'payment unit' in the fourth row and the second column; 4) reading the content corresponding to the total sum in the second column of the fifth row; … …, etc., so that the PDF document can be parsed according to the configured parsing rule when being parsed.
In the embodiment of the application, the analysis of format contents can be reduced by configuring the analysis template; for example, in the payment sheet, the "payment unit" in the fourth row and the first column may not be analyzed; the "total amount" in the fifth row and the first column may not be analyzed.
Furthermore, in the embodiment of the application, through the configuration of the parsing rule, some contents which do not need to be parsed can be subjected to non-parsing processing, so that parsing efficiency is improved. For example, in the payment order, the content corresponding to the "summary" in the sixth row and the second column may not be analyzed through configuration.
In the embodiment of the present application, the analysis rule may also be an analysis rule for specific content, for example, when the content read in the form below the 7 th row and the first column "hundred" in the payment order is x, the analysis data may be output as null, and the like.
The parsing rule in the embodiment of the present application may also be various parsing rules for PDF document parsing, which are known by those skilled in the art now and in the future, and will not be described in detail herein.
The storage rule in the embodiment of the present application may be that the analysis data in the corresponding analysis template is configured to be stored in the target location in the target database, for example, after the content of the corresponding "number" in the second row and the second column in the payment order is analyzed, the analysis data is stored in the first column in the database table.
The storage rule in the embodiment of the present application may also be a target database in which the configuration analysis data needs to be stored.
And S220, when the PDF document to be analyzed is received, identifying key information in the PDF document to be analyzed, and calling a target analysis template, a target analysis rule and a target storage rule which are matched with the key information according to the key information.
In this embodiment of the present application, an electronic device (for example, the server/terminal device shown in fig. 1) on which the PDF document parsing method operates may receive a PDF document to be parsed in a wired connection manner or a wireless connection manner. It should be noted that the wireless connection means may include, but is not limited to, a 3G/4G/5G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a uwb (ultra wideband) connection, and other wireless connection means now known or developed in the future.
In the embodiment of the present application, when configuring the parsing template, the user may configure an identifier corresponding to the parsing template, for example, the identifier may be header information of a document template, such as a "payment order" shown in fig. 3; header information of the document template and the size of the document template may also be used, for example, as shown in fig. 3, the identifier corresponding to the parsing template is configured as a "bill of payment" and the document size is configured as a small form. It will be appreciated that document size may also be identified in other suitable, readily distinguishable, forms.
In the embodiment of the application, the identified key information is matched with the identifier of the analysis template, so that the target analysis template, the target analysis rule and the target storage rule which are matched with the key information are obtained.
And S230, analyzing the PDF document to be analyzed according to the analysis rule of the target analysis template to generate analysis data.
In the embodiment of the present application, the parsing rule may be a parsing rule corresponding to a target parsing template, for example, referring to fig. 3, contents corresponding to "numbers" in the second row and the second column are read, and contents of target positions that need to be read in sequence may be configured corresponding to the template of the payment order. The target position can be set by the row and column coordinates or by the key value.
And S240, storing the analysis data into a target database according to the storage rule of the target analysis template.
According to the embodiment of the application, the analysis data are stored in the target database according to the storage rule of the target analysis template, so that the analysis data can be stored according to the data structure desired by a user, and manual format adjustment is not needed. For example, referring to FIG. 3, after reading the corresponding "number" of the second column in the second row, it is stored to the numbered column in the database.
In one embodiment, step S210 includes:
s211, acquiring the identity information of the user.
The user can firstly register an account on the terminal device, the account is bound with identity information for distinguishing different users, and the identity information can be an account name, a user job number or a unit name and a job title of the user.
In one example, the identity information is the name of the user, and the name of the job, so that the staff of the job in the job can use the same configuration information for resolution, and information exchange between different staff of the job is facilitated.
S212, acquiring configuration information bound with the identity information according to the identity information.
In one example, after the user logs in to the account, the user performs configuration, and the configured configuration information is associated with the account, so that the user can directly call the configuration information during subsequent login.
In one example, the authority of the configuration information can be set, the identity of the same entity name and the same job can be set, and only the account name with the configuration authority has the authority of the configuration information.
In one example, any account has the authority to configure information so that the PDF document parsing tool can follow personal preferences.
In one embodiment, step S220 includes:
s221, identifying block elements in the PDF document to be analyzed, wherein the block elements comprise characters, tables or graphs.
S222, identifying key information contained in the block elements, matching the key information with the analysis template, and calling a target analysis template matched with the key information.
In the embodiment of the present application, the block elements in the PDF document to be parsed are identified by using the existing picture identification technology, which is not limited in the embodiment of the present application.
In one example, the block elements may be distinguished according to the spacing of pixel points of the PDF document. After the first block element is identified, two lines are empty lines, and the pixel point of the next line is another block element.
And under the condition that the block element is identified as the character, reading the character content in the character, matching the character content with the analysis template, and calling a target analysis template matched with the character content.
In one example, the target parsing template for textual content may be a universal template.
In one example, the target parsing template of the text content may be a matching target parsing template according to a keyword group in the text content, and a general template is adopted in no matching.
In the embodiment of the present application, the parsing rule of the universal template may be reading characters line by line.
Reading header information in the table in the case that the block element is identified as the table; matching the header information with the analysis template, and calling the target analysis template matched with the header information.
Reading character information or pattern information in the graph under the condition that the block element is identified as the graph; matching the character information or the pattern information with an analysis template, and calling a target analysis template, a target analysis rule or a target storage rule matched with the character information or the pattern information.
In one example, the graph is a data graph, and the text information in the reading graph can be coordinate axis information in the reading graph.
In one example, in the case that the block element is identified as a graph, the parsing rule may be to identify whether a coordinate axis is included in the graph, and in the case that the coordinate axis is included, the coordinate axis information is read, and the target parsing template is matched through the coordinate axis information.
In one example, where the graph is a data graph, the parsing rule may be to read a peak value of the graph, a trough value of the graph, or trend information of the graph (e.g., how much the scale is up or down).
In one example, in the case where no coordinate axes are included in the recognition graph, the parsing rule may be to read RGB values for each pixel point position.
In one embodiment, step S230 includes: reading the content of the target position according to the target position set by the analysis rule, and generating analysis data; and/or the presence of a gas in the gas,
and reading the content of the target position according to the target position and the target processing mode set by the analysis rule, and performing data processing according to the target processing mode to generate analysis data.
In one example, in the PDF document to be analyzed, one block element is identified as a form, header information in the read form is a "payment bill", and the read form is matched with an analysis template in the configuration information, so as to obtain an analysis rule and a storage rule corresponding to the analysis template being the "payment bill".
As an example, the parsing rule may be, for example, referring to fig. 3, reading the content corresponding to the "number" in the second row and the second column, and the content of the target location that needs to be read in sequence may be configured corresponding to the template of the payment slip. The target position can be set by the row and column coordinates or by the key value.
As an example, the parsing rule may also be that, for example, referring to fig. 3, the content of the third row is read as "10/21/2020", and converted into "2020/10/21" to generate the parsing data.
In one embodiment, the reading the content of the target position and performing data processing according to the target processing mode according to the target position and the target processing mode set by the analysis rule, and the generating of the analysis data specifically includes:
and under the condition that the content of the read target position is the option frame, acquiring RGB values of the option frame, determining the state of the option frame according to the RGB values, and generating analysis data.
Specifically, the read content is an option box identifier, and the RGB values of the option box are read to determine whether the value is selected or unselected. For example, if the RGB value of the tab box id is read as (192,192,192), the color of the tab box is gray and the status is selected. And generating the data with the analysis data in the selected state. When the RGB value of the tab box identifier is read to be (255,251,240), the color of the tab box is white and the status is unselected. And generating data with the analysis data in the unselected state.
It should be noted that, in the embodiment of the present application, the value of the option box is determined to be the value in the selected or unselected state according to the RGB values of the option box, and the range of the RGB values may be set by the user according to the personalized requirements.
In an example, the read content is a hierarchical number identifier, and an analysis rule may be configured according to the characteristics of the hierarchical number identifier, for example, the size of the hierarchical number identifier is read, and the larger hierarchical number identifier is the document content of the first hierarchy; the level number identifies moderate document content as a second level; the tier number identifies the smaller document content as a third tier.
In one embodiment, the configuration information further includes a data archiving rule, and as shown in fig. 4, the PDF document parsing method further includes step S250: and archiving or clearing the analytic data in the target database according to the data archiving rule.
In an example, the analysis template may be configured as a document template of a payment order, and the document template is stored in different payment units, that is, the analysis data of the payment unit a is stored in a sub-list of the target database AA.
In one example, a document template with a parsing template of B may be configured and purged from the target database within a week after the parsing data is generated.
In one embodiment, step S220 further includes, before:
s2201, obtaining the format information of the PDF document to be analyzed.
Existing PDF documents can be divided into two formats: one is Acroform published 1997 and one is XFA Forms published 2002.
And S2202, reading the content in the PDF document to be analyzed by adopting a reading mode corresponding to the format information according to the format information.
For an Acroform format PDF document, data is analyzed by reading the content under the target row coordinates and the target column coordinates.
For a PDF document in XFA Forms format, data is parsed by reading the content under the target key value.
Fig. 5 shows a PDF document parsing apparatus according to an embodiment of the present application, and as shown in fig. 5, the PDF document parsing apparatus includes:
an obtaining module 410, configured to obtain configuration information for PDF document analysis; the configuration information comprises an analysis template and an analysis rule and a storage rule corresponding to the analysis template;
the matching module 420 is configured to, when a PDF document to be parsed is received, identify key information in the PDF document to be parsed, match the key information with a parsing template, and retrieve a target parsing template matched with the key information;
the analysis module 430 is configured to analyze the PDF document to be analyzed according to the analysis rule of the target analysis template, so as to generate analysis data;
and the storage module 440 is configured to store the parsing data in the target database according to the storage rule of the target parsing template.
In one embodiment, the obtaining module 410 includes:
the identity information acquisition submodule is used for acquiring the identity information of the user;
and the configuration information acquisition submodule is used for acquiring the configuration information bound with the identity information according to the identity information.
In one embodiment, the matching module 420 includes:
the module element identification sub-module is used for identifying module elements in the PDF document to be analyzed, and the module elements comprise characters, tables or graphs;
the template matching submodule is used for identifying key information contained in the block elements, matching the key information with the analysis template and calling a target analysis template matched with the key information; wherein,
under the condition that the block elements are identified as characters, reading character contents in the characters, matching the character contents with an analysis template, and calling a target analysis template matched with the character contents;
and for reading header information in the table in the case that the block element is identified as the table; matching the header information with an analysis template, and calling a target analysis template, a target analysis rule or a target storage rule matched with the header information;
and reading character information or pattern information in the graph under the condition that the block element is identified as the graph; matching the character information or the pattern information with an analysis template, and calling a target analysis template, a target analysis rule or a target storage rule matched with the character information or the pattern information.
In an embodiment, the parsing module 430 is specifically configured to:
reading the content of the target position according to the target position set by the analysis rule, and generating analysis data;
and reading the content of the target position according to the target position and the target processing mode set by the analysis rule, and performing data processing according to the target processing mode to generate analysis data.
In one embodiment. The parsing module 430 is specifically configured to, when the content of the target location is the option frame, obtain RGB values of the option frame, determine a state of the option frame according to the RGB values, and generate parsing data.
In one embodiment, the configuration information further includes a data archiving rule, and as shown in fig. 6, the PDF document parsing apparatus further includes an archiving module 450 for archiving or clearing the parsed data in the target database according to the data archiving rule.
In one embodiment, the matching module 420 includes:
the format information acquisition submodule is used for acquiring format information of the PDF document to be analyzed;
and the reading mode acquisition submodule is used for reading the content in the PDF document to be analyzed by adopting a reading mode corresponding to the format information according to the format information.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 7, fig. 7 is a block diagram of a basic structure of a computer device according to the present embodiment.
The computer device 7 comprises a memory 71, a processor 72, a network interface 73, which are communicatively connected to each other via a system bus. It is noted that only a computer device 7 having components 71-73 is shown, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory 71 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 71 may be an internal storage unit of the computer device 7, such as a hard disk or a memory of the computer device 7. In other embodiments, the memory 71 may also be an external storage device of the computer device 7, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 7. Of course, the memory 71 may also comprise both an internal storage unit of the computer device 7 and an external storage device thereof. In this embodiment, the memory 71 is generally used for storing an operating system installed in the computer device 7 and various application software, such as program codes of a PDF document parsing method. Further, the memory 71 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 72 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 72 is typically used to control the overall operation of the computer device 7. In this embodiment, the processor 72 is configured to execute the program code stored in the memory 71 or process data, for example, the program code for executing the X method.
The network interface 73 may comprise a wireless network interface or a wired network interface, and the network interface 73 is generally used for establishing a communication connection between the computer device 7 and other electronic devices.
The present application further provides another embodiment, which is to provide a computer-readable storage medium storing a PDF document parsing program, which is executable by at least one processor to cause the at least one processor to execute the steps of the PDF document parsing method as described above.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.
Claims (10)
1. A PDF document parsing method is characterized by comprising the following steps:
acquiring configuration information for PDF document analysis; the configuration information comprises an analysis template and an analysis rule and a storage rule corresponding to the analysis template;
when a PDF document to be analyzed is received, identifying key information in the PDF document to be analyzed, matching the key information with the analysis template, and calling a target analysis template matched with the key information;
analyzing the PDF document to be analyzed according to the analysis rule of the target analysis template to generate analysis data;
and storing the analysis data into a target database according to the storage rule of the target analysis template.
2. The method according to claim 1, wherein the obtaining configuration information for PDF document parsing comprises:
acquiring identity information of a user;
and acquiring configuration information bound with the identity information according to the identity information.
3. The method according to claim 1, wherein the identifying key information in the PDF document to be parsed, matching the key information with the parsing template, and calling a target parsing template matched with the key information specifically includes:
identifying block elements in the PDF document to be analyzed, wherein the block elements comprise characters, tables or graphs;
identifying key information contained in the block elements, matching the key information with the analysis template, and calling a target analysis template matched with the key information; under the condition that the block elements are identified as characters, reading character contents in the characters, matching the character contents with the analysis template, and calling a target analysis template matched with the character contents;
reading header information in the table in the case that the block element is identified as the table; matching the header information with the analysis template, and calling a target analysis template matched with the header information;
reading character information or pattern information in the graph under the condition that the block element is identified as the graph; matching the character information or the pattern information with the analysis template, and calling a target analysis template matched with the character information or the pattern information.
4. The method according to claim 1, wherein the parsing the PDF document to be parsed according to the parsing rule of the target parsing template to generate parsing data specifically comprises:
reading the content of the target position according to the target position set by the analysis rule to generate analysis data; and/or the presence of a gas in the gas,
and reading the content of the target position according to the target position and the target processing mode set by the analysis rule, and performing data processing according to the target processing mode to generate analysis data.
5. The method according to claim 4, wherein the reading the content of the target position and performing data processing according to the target processing method and the target position and the target processing method set according to the analysis rule, and the generating of the analysis data specifically includes:
and under the condition that the content of the target position is read as an option frame, acquiring RGB values of the option frame, determining the state of the option frame according to the RGB values, and generating analysis data.
6. The method of claim 1, wherein the configuration information further includes a data archiving rule, and wherein storing the parsed data to a target database according to the storage rule further comprises:
and archiving or clearing the analytic data in the target database according to the data archiving rule.
7. The method according to any one of claims 1 to 6, wherein before the identifying key information in the PDF document to be parsed when the PDF document to be parsed is received and calling a target parsing template, a target parsing rule and a target storage rule matching the key information according to the key information, the method further comprises:
acquiring format information of the PDF document to be analyzed;
and reading the content in the PDF document to be analyzed by adopting a reading mode corresponding to the format information according to the format information.
8. A PDF document parsing apparatus, comprising:
the acquisition module is used for acquiring the configuration information of PDF document analysis; the configuration information comprises an analysis template and an analysis rule and a storage rule corresponding to the analysis template;
the matching module is used for identifying key information in the PDF document to be analyzed when the PDF document to be analyzed is received, matching the key information with the analysis template and calling a target analysis template matched with the key information;
the analysis module is used for analyzing the PDF document to be analyzed according to the analysis rule of the target analysis template to generate analysis data;
and the storage module is used for storing the analysis data to a target database according to the storage rule of the target analysis template.
9. A computer device comprising a memory having stored therein a computer program and a processor which when executed implements the steps of a PDF document parsing method according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the PDF document parsing method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111402342.2A CN114330240A (en) | 2021-11-19 | 2021-11-19 | PDF document analysis method and device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111402342.2A CN114330240A (en) | 2021-11-19 | 2021-11-19 | PDF document analysis method and device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114330240A true CN114330240A (en) | 2022-04-12 |
Family
ID=81047497
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111402342.2A Pending CN114330240A (en) | 2021-11-19 | 2021-11-19 | PDF document analysis method and device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114330240A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116578271A (en) * | 2023-07-12 | 2023-08-11 | 卡斯柯信号(北京)有限公司 | Drawing method and device for application design process model diagram |
CN118349523A (en) * | 2024-04-12 | 2024-07-16 | 广州三七极梦网络技术有限公司 | Automatic analysis method, device, equipment and medium for target file data |
-
2021
- 2021-11-19 CN CN202111402342.2A patent/CN114330240A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116578271A (en) * | 2023-07-12 | 2023-08-11 | 卡斯柯信号(北京)有限公司 | Drawing method and device for application design process model diagram |
CN116578271B (en) * | 2023-07-12 | 2023-11-28 | 卡斯柯信号(北京)有限公司 | Drawing method and device for application design process model diagram |
CN118349523A (en) * | 2024-04-12 | 2024-07-16 | 广州三七极梦网络技术有限公司 | Automatic analysis method, device, equipment and medium for target file data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111552704A (en) | Data report generation method and device, computer equipment and storage medium | |
CN114330240A (en) | PDF document analysis method and device, computer equipment and storage medium | |
CN112084752A (en) | Statement marking method, device, equipment and storage medium based on natural language | |
CN111651552A (en) | Structured information determination method and device and electronic equipment | |
CN115758451A (en) | Data labeling method, device, equipment and storage medium based on artificial intelligence | |
CN117195886A (en) | Text data processing method, device, equipment and medium based on artificial intelligence | |
CN112016502A (en) | Safety belt detection method and device, computer equipment and storage medium | |
CN114022891A (en) | Method, device and equipment for extracting key information of scanned text and storage medium | |
EP3564833B1 (en) | Method and device for identifying main picture in web page | |
CN117133006A (en) | Document verification method and device, computer equipment and storage medium | |
WO2018208412A1 (en) | Detection of caption elements in documents | |
CN117217684A (en) | Index data processing method and device, computer equipment and storage medium | |
CN116774973A (en) | Data rendering method, device, computer equipment and storage medium | |
CN111797297A (en) | Page data processing method and device, computer equipment and storage medium | |
CN116450723A (en) | Data extraction method, device, computer equipment and storage medium | |
CN111177387A (en) | User list information processing method, electronic device and computer readable storage medium | |
CN114359928B (en) | Electronic invoice identification method and device, computer equipment and storage medium | |
CN114912003A (en) | Document searching method and device, computer equipment and storage medium | |
CN114510908A (en) | Data export method and device, computer equipment and storage medium | |
CN113989618A (en) | Recyclable article classification and identification method | |
CN113779198A (en) | Electronic business card generating method, device, equipment and medium based on artificial intelligence | |
CN113609833A (en) | Dynamic generation method and device of file, computer equipment and storage medium | |
CN112395450A (en) | Picture character detection method and device, computer equipment and storage medium | |
CN110851346A (en) | Method, device and equipment for detecting boundary problem of query statement and storage medium | |
CN114564917A (en) | EXCEL document analysis method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |