CN110609982A - PDF file data analysis system and method - Google Patents

PDF file data analysis system and method Download PDF

Info

Publication number
CN110609982A
CN110609982A CN201910730435.4A CN201910730435A CN110609982A CN 110609982 A CN110609982 A CN 110609982A CN 201910730435 A CN201910730435 A CN 201910730435A CN 110609982 A CN110609982 A CN 110609982A
Authority
CN
China
Prior art keywords
pdf file
analysis
pdf
file
configuration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910730435.4A
Other languages
Chinese (zh)
Inventor
马立鹏
刘波
孙晓晨
徐焕锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Supcon Technology Co Ltd
Original Assignee
Zhejiang Supcon Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Supcon Technology Co Ltd filed Critical Zhejiang Supcon Technology Co Ltd
Priority to CN201910730435.4A priority Critical patent/CN110609982A/en
Publication of CN110609982A publication Critical patent/CN110609982A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a PDF file data analysis system, which comprises an analysis template configuration module, a PDF file acquisition module, a PDF file analysis module and a server; the analysis template configuration module performs analysis configuration according to the content format of the PDF file to be analyzed; the PDF file acquisition module acquires a PDF file according to a network path on the server, stores the PDF file into the server, calculates the MD5 of the acquired PDF file and verifies the PDF file; and the PDF file analysis module analyzes the PDF file according to the configuration analysis template. The invention configures the analysis template according to the PDF format by the user self-defining, and the system automatically acquires the file according to the file path, records and stores the file, thereby preventing the falsification, reducing the workload of artificially inputting data and the generated error, and ensuring the correctness of the experimental data.

Description

PDF file data analysis system and method
Technical Field
The invention relates to the field of data analysis, in particular to a PDF file data analysis system and a PDF file data analysis method.
Background
The information extraction is to perform structuring processing on the information contained in the text to form an organization form like a table. The input of the information extraction system is original text, and the output is fixed format information points. Information points are extracted from various documents and then integrated together in a unified form, which is the main task of information extraction. The benefit of integrating information together in a uniform fashion is ease of review and comparison. Information extraction techniques do not attempt to fully understand the entire document, but rather analyze portions of the document that contain relevant information. Which information is relevant depends on the domain scope under the timing of system design.
The core function of an experiment execution system (LES) or an electronic experiment recording system (ELN) is that result data in a measuring equipment output map file can be rapidly and accurately analyzed and stored in a designated field in a data air, and subsequent calling and calculation are facilitated.
PDF (Portable Document Format) is an electronic Document Format independent of hardware, operating system, and application program. Because of its advantages of cross-platform, multimedia integration, security, etc., PDF has become one of the most widely used electronic document formats at present. With the widespread use of PDF formatted documents, a great deal of valuable data is presented in the form of PDF documents. Therefore, how to extract data from a PDF document is a problem that has been widely noticed and researched.
Disclosure of Invention
The purpose of the invention is as follows: the invention provides a PDF file data analysis system and a method aiming at the defects, wherein a user configures an analysis template according to the format of PDF, so that one device can export the quick positioning analysis templates of various PDF file content formats, and the system automatically acquires files according to file paths, records and stores the files and prevents falsification. The workload of artificially inputting data and the generated errors are reduced, and the correctness of experimental data is ensured.
The technical scheme is as follows:
the PDF file data analysis system comprises an analysis template configuration module, a PDF file acquisition module, a PDF file analysis module and a server;
the analysis template configuration module performs analysis configuration according to the content format of the PDF file to be analyzed; the specific analysis configuration mode comprises the following steps:
(1) according to the key value pair of the PDF file content; the analysis template configuration module is provided with a configuration interface, a certain value in the key value pair configured on the analysis template configuration module is selected as a front key through a frame selection tool, and then a value is selected as a rear key, and the content between the two keys is the content to be analyzed;
(2) according to the coordinates of the content of the PDF file in the file, the coordinates comprise the page number, the scaling coefficient, the abscissa, the ordinate, the width and the height of the page number;
(3) according to the types of the regular tables in the PDF file, the types of the tables are divided into frames and non-frames;
(4) according to irregular tables in the PDF file;
the PDF file acquisition module acquires a PDF file according to a network path on the server, stores the PDF file into the server, calculates the MD5 of the acquired PDF file and verifies the PDF file;
and the PDF file analysis module analyzes the PDF file according to the configuration analysis template.
The PDF file acquisition module monitors a local or network path on the server in real time, and when newly-added PDF files are monitored in the local or network path, the corresponding PDF files are copied to a file directory to be analyzed of the system; then, performing feature analysis on the content of the PDF file to obtain basic information of the file, wherein the basic information comprises equipment, a sample and a matched analysis template;
the feature analysis is configured in the analysis template configuration module and is one or more feature values, and each feature is provided with a corresponding proportion; calculating the matching degree of the PDF files needing feature analysis, matching the PDF files according to each configured feature value, then adding the specific gravity of each feature value, and finally calculating the highest score, namely the best analysis template.
If the content of the PDF file to be analyzed is at the beginning or the end of the whole PDF file, the analysis template configuration module only needs to configure a front key or a back key.
A user selects and configures a specific analysis template according to actual requirements, and a plurality of commonly used analysis configuration templates are pre-configured in a server.
A PDF file data analysis method comprises the following steps:
the method comprises the following steps that firstly, an analysis template configuration module carries out analysis configuration according to the content format of a PDF file to be analyzed to obtain a configuration analysis template, and the specific analysis configuration mode comprises the following steps:
(1) according to the key value pair of the PDF file content;
(2) according to the coordinates of the content of the PDF file in the file, the coordinates comprise the page number, the scaling coefficient, the abscissa, the ordinate, the width and the height of the page number;
(3) according to the types of the regular tables in the PDF file, the types of the tables are divided into frames and non-frames;
(4) according to irregular tables in the PDF file;
acquiring PDF files on a network path through a PDF file acquisition module, storing the PDF files in a server, calculating MD5 of the acquired PDF files and verifying the PDF files;
and thirdly, the PDF file analysis module analyzes the PDF file according to the configuration analysis template.
Has the advantages that: the invention is suitable for analyzing and storing data in files such as atlas, report and report in PDF format, which are output after various measuring devices in inspection and analysis institutions such as scientific research laboratory, pharmaceutical enterprise quality control laboratory, third party detection laboratory and the like carry out analysis and detection. And acquiring data by adopting the relative position, key value and table data mode of PDF. The invention configures the analysis template according to the PDF format by the user self-defining, and the system automatically acquires the file according to the file path, records and stores the file, thereby preventing the falsification, reducing the workload of artificially inputting data and the generated error, and ensuring the correctness of the experimental data. The PDF analysis template is ensured to be in one-to-one correspondence with the actual file through feature analysis, and the analysis efficiency and accuracy are improved.
Drawings
FIG. 1 is a diagram of a configuration interface of an analytic template according to the present invention;
FIG. 2 is a flow chart of file monitoring in the present invention;
FIG. 3 is a schematic diagram of a file parsing data flow according to the present invention;
FIG. 4 is a diagram of an instrument configuration in accordance with an embodiment of the present invention;
fig. 5 is a graph of experimental data acquisition for a specific embodiment of the present invention.
Detailed Description
The invention is further elucidated with reference to the drawings and the embodiments.
The PDF file data analysis system comprises an analysis template configuration module, a PDF file acquisition module, a PDF file analysis module and a server;
the analysis template configuration module performs analysis configuration according to the content format of the PDF file to be analyzed, and the specific analysis configuration mode comprises the following steps:
(1) according to the key value pair of the PDF file content; the analysis template configuration module is provided with a configuration interface, a certain value in the key value pair configured on the analysis template configuration module is selected as a front key through a frame selection tool, and then a value is selected as a rear key, and the content between the two keys is the content to be analyzed; the PDF file content to be partially parsed may be at the beginning or end of the entire PDF file, and in this case, the front key or the back key may be configured.
(2) According to the coordinates of the content of the PDF file in the file, the coordinates comprise the page number, the scaling coefficient, the abscissa, the ordinate, the width and the height of the page number;
(3) according to the types of the regular tables in the PDF file, the types of the tables are divided into frames and non-frames;
(4) according to irregular tables in the PDF file;
the user selects and configures a specific analysis template according to actual requirements, and generally, a plurality of commonly used analysis configuration templates are pre-configured in the server.
The PDF file acquisition module acquires a PDF file according to a network path on the server, stores the PDF file into the server, calculates the MD5 of the acquired PDF file and verifies the PDF file; the method specifically comprises the following steps: the PDF file acquisition module monitors a local or network path on the server in real time, and when newly-added PDF files are monitored in the local or network path, the corresponding PDF files are copied to a file directory to be analyzed of the system; then, performing feature analysis on the content of the PDF file to obtain basic information of the file, wherein the basic information comprises equipment, a sample and a matched analysis template, and recording and processing logs to be stored in a server; the feature analysis is configured in the analysis template configuration module, the feature analysis can configure one or more feature values, each feature has a specific gravity (the total score is 100), then when the PDF file is analyzed, the matching degree is calculated for the PDF file needing the feature analysis, the PDF file is matched according to each configured feature value, then the specific gravities of the feature values are added, and finally the best analysis template is obtained after the calculated score is the highest. And finally returning to obtain the equipment type and the set of the analysis template by the feature analysis.
And the PDF file analysis module analyzes the PDF file according to the configuration analysis template.
The PDF file data analysis method adopting the PDF file data analysis system comprises the following steps:
the method comprises the following steps that firstly, an analysis template configuration module carries out analysis configuration according to the content format of a PDF file to be analyzed to obtain a configuration analysis template, and the specific analysis configuration mode comprises the following steps:
(1) according to the key value pair of the PDF file content;
(2) according to the coordinates of the content of the PDF file in the file, the coordinates comprise the page number, the scaling coefficient, the abscissa, the ordinate, the width and the height of the page number;
(3) according to the types of the regular tables in the PDF file, the types of the tables are divided into frames and non-frames;
(4) according to irregular tables in the PDF file;
acquiring PDF files on a network path through a PDF file acquisition module, storing the PDF files in a server, calculating MD5 of the acquired PDF files and verifying the PDF files;
and thirdly, the PDF file analysis module analyzes the PDF file according to the configuration analysis template.
The invention configures the analysis template according to the format of the PDF by the user self-definition, supports the characteristic analysis of the configured PDF, and realizes that one device exports the quick positioning analysis templates of various PDF file content formats. The system automatically acquires the file according to the file path, records and stores the file, and prevents falsification. The content of the file is automatically obtained and stored according to the analysis template, the workload of manually inputting data and errors generated are reduced, and the correctness of experimental data is ensured.
The PDF file content exported by the laboratory instrument in the pharmaceutical industry is automatically analyzed, original data are stored, the whole design idea of falsification is prevented, and any experiment execution system according to the design scheme can solve the problem of content analysis of the PDF file exported by instrument equipment and the problem of recording and auditing of experiment data.
Although the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the details of the foregoing embodiments, and various equivalent changes (such as number, shape, position, etc.) may be made to the technical solution of the present invention within the technical spirit of the present invention, and the equivalents are protected by the present invention.

Claims (5)

  1. The PDF file data analysis system is characterized in that: the system comprises an analysis template configuration module, a PDF file acquisition module, a PDF file analysis module and a server;
    the analysis template configuration module performs analysis configuration according to the content format of the PDF file to be analyzed; the specific analysis configuration mode comprises the following steps:
    (1) according to the key value pair of the PDF file content; the analysis template configuration module is provided with a configuration interface, a certain value in the key value pair configured on the analysis template configuration module is selected as a front key through a frame selection tool, and then a value is selected as a rear key, and the content between the two keys is the content to be analyzed;
    (2) according to the coordinates of the content of the PDF file in the file, the coordinates comprise the page number, the scaling coefficient, the abscissa, the ordinate, the width and the height of the page number;
    (3) according to the types of the regular tables in the PDF file, the types of the tables are divided into frames and non-frames;
    (4) according to irregular tables in the PDF file;
    the PDF file acquisition module acquires a PDF file according to a network path on the server, stores the PDF file into the server, calculates the MD5 of the acquired PDF file and verifies the PDF file;
    and the PDF file analysis module analyzes the PDF file according to the configuration analysis template.
  2. 2. The PDF file data parsing system according to claim 1, wherein: the PDF file acquisition module monitors a local or network path on the server in real time, and when newly-added PDF files are monitored in the local or network path, the corresponding PDF files are copied to a file directory to be analyzed of the system; then, performing feature analysis on the content of the PDF file to obtain basic information of the file, wherein the basic information comprises equipment, a sample and a matched analysis template;
    the feature analysis is configured in the analysis template configuration module and is one or more feature values, and each feature is provided with a corresponding proportion; calculating the matching degree of the PDF files needing feature analysis, matching the PDF files according to each configured feature value, then adding the specific gravity of each feature value, and finally calculating the highest score, namely the best analysis template.
  3. 3. The PDF file data parsing system according to claim 1, wherein: if the content of the PDF file to be analyzed is at the beginning or the end of the whole PDF file, the analysis template configuration module only needs to configure a front key or a back key.
  4. 4. The PDF file data parsing system according to claim 1, wherein: a user selects and configures a specific analysis template according to actual requirements, and a plurality of commonly used analysis configuration templates are pre-configured in a server.
  5. 5. A PDF file data analysis method applying the PDF file data analysis system of any one of claims 1-4, characterized in that: the method comprises the following steps:
    the method comprises the following steps that firstly, an analysis template configuration module carries out analysis configuration according to the content format of a PDF file to be analyzed to obtain a configuration analysis template, and the specific analysis configuration mode comprises the following steps:
    (1) according to the key value pair of the PDF file content;
    (2) according to the coordinates of the content of the PDF file in the file, the coordinates comprise the page number, the scaling coefficient, the abscissa, the ordinate, the width and the height of the page number;
    (3) according to the types of the regular tables in the PDF file, the types of the tables are divided into frames and non-frames;
    (4) according to irregular tables in the PDF file;
    acquiring PDF files on a network path through a PDF file acquisition module, storing the PDF files in a server, calculating MD5 of the acquired PDF files and verifying the PDF files;
    and thirdly, the PDF file analysis module analyzes the PDF file according to the configuration analysis template.
CN201910730435.4A 2019-08-08 2019-08-08 PDF file data analysis system and method Pending CN110609982A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910730435.4A CN110609982A (en) 2019-08-08 2019-08-08 PDF file data analysis system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910730435.4A CN110609982A (en) 2019-08-08 2019-08-08 PDF file data analysis system and method

Publications (1)

Publication Number Publication Date
CN110609982A true CN110609982A (en) 2019-12-24

Family

ID=68889904

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910730435.4A Pending CN110609982A (en) 2019-08-08 2019-08-08 PDF file data analysis system and method

Country Status (1)

Country Link
CN (1) CN110609982A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111913910A (en) * 2020-06-23 2020-11-10 复旦大学附属中山医院厦门医院 Follow-up file data extraction method and system
CN112861821A (en) * 2021-04-06 2021-05-28 刘羽 Map data reduction method based on PDF file analysis
CN112861822A (en) * 2021-04-06 2021-05-28 刘羽 Map data processing method based on PDF file analysis

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090234818A1 (en) * 2008-03-12 2009-09-17 Web Access Inc. Systems and Methods for Extracting Data from a Document in an Electronic Format
CN105740267A (en) * 2014-12-10 2016-07-06 北大方正集团有限公司 PDF (Portable Document Format) file processing method and apparatus
CN108415887A (en) * 2018-02-09 2018-08-17 武汉大学 A kind of method that pdf document is converted to OFD files
CN109726369A (en) * 2017-10-31 2019-05-07 中博信息技术研究院有限公司 A kind of intelligent template questions record Implementation Technology based on normative document
CN109726388A (en) * 2018-05-07 2019-05-07 深圳壹账通智能科技有限公司 Pdf document analytic method, device, equipment and computer readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090234818A1 (en) * 2008-03-12 2009-09-17 Web Access Inc. Systems and Methods for Extracting Data from a Document in an Electronic Format
CN105740267A (en) * 2014-12-10 2016-07-06 北大方正集团有限公司 PDF (Portable Document Format) file processing method and apparatus
CN109726369A (en) * 2017-10-31 2019-05-07 中博信息技术研究院有限公司 A kind of intelligent template questions record Implementation Technology based on normative document
CN108415887A (en) * 2018-02-09 2018-08-17 武汉大学 A kind of method that pdf document is converted to OFD files
CN109726388A (en) * 2018-05-07 2019-05-07 深圳壹账通智能科技有限公司 Pdf document analytic method, device, equipment and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵豪迈: "《数字档案长期保存研究》", 30 June 2015, 陕西师范大学出版总社有限公司 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111913910A (en) * 2020-06-23 2020-11-10 复旦大学附属中山医院厦门医院 Follow-up file data extraction method and system
CN111913910B (en) * 2020-06-23 2022-10-11 复旦大学附属中山医院厦门医院 Follow-up file data extraction method and system
CN112861821A (en) * 2021-04-06 2021-05-28 刘羽 Map data reduction method based on PDF file analysis
CN112861822A (en) * 2021-04-06 2021-05-28 刘羽 Map data processing method based on PDF file analysis
CN112861822B (en) * 2021-04-06 2024-03-12 刘羽 Map data processing method based on PDF file analysis
CN112861821B (en) * 2021-04-06 2024-04-19 刘羽 Map data reduction method based on PDF file analysis

Similar Documents

Publication Publication Date Title
KR102171220B1 (en) Character recognition method, device, server and storage medium of claim documents
CN106445795B (en) A kind of database SQL Efficiency testing method and device
CN106776515B (en) Data processing method and device
WO2019237540A1 (en) Method and device for acquiring financial data, terminal device, and medium
CN110609982A (en) PDF file data analysis system and method
US11869263B2 (en) Automated classification and interpretation of life science documents
CN106919612B (en) Processing method and device for online structured query language script
CN112506951B (en) Processing method, server, computing device and system for database slow query log
WO2020057021A1 (en) Data table processing method and device, computer device and storage medium
US11163806B2 (en) Obtaining candidates for a relationship type and its label
US9965540B1 (en) System and method for facilitating associating semantic labels with content
CN110737689B (en) Data standard compliance detection method, device, system and storage medium
US20140115438A1 (en) Generation of test data using text analytics
US10121150B2 (en) Compliance as a service for an organization
US11574491B2 (en) Automated classification and interpretation of life science documents
CN112418813A (en) AEO qualification intelligent rating management system and method based on intelligent analysis and identification and storage medium
KR20120003567A (en) Log management system, log processing method of the same of and recording medium storing the log processing method of the same of
CN114115831A (en) Data processing method, device, equipment and storage medium
CN115563985A (en) Statement analysis method, statement analysis device, statement analysis apparatus, storage medium, and program product
US9275358B1 (en) System, method, and computer program for automatically creating and submitting defect information associated with defects identified during a software development lifecycle to a defect tracking system
CN113032515A (en) Method, system, device and storage medium for generating chart based on multiple data sources
CN112463728A (en) Bibliographic data extraction method of scientific and technological literature
CN112416727A (en) Batch processing operation checking method, device, equipment and medium
CN112163072B (en) Data processing method and device based on multiple data sources
CN111858732B (en) Data fusion method and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191224

RJ01 Rejection of invention patent application after publication