CN110609982A - PDF file data analysis system and method - Google Patents
PDF file data analysis system and method Download PDFInfo
- Publication number
- CN110609982A CN110609982A CN201910730435.4A CN201910730435A CN110609982A CN 110609982 A CN110609982 A CN 110609982A CN 201910730435 A CN201910730435 A CN 201910730435A CN 110609982 A CN110609982 A CN 110609982A
- Authority
- CN
- China
- Prior art keywords
- pdf file
- analysis
- file
- configuration
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007405 data analysis Methods 0.000 title claims abstract description 15
- 238000000034 method Methods 0.000 title claims description 13
- 238000004458 analytical method Methods 0.000 claims abstract description 94
- 230000001788 irregular Effects 0.000 claims description 6
- 230000005484 gravity Effects 0.000 claims description 4
- 238000002474 experimental method Methods 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000003908 quality control method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a PDF file data analysis system, which comprises an analysis template configuration module, a PDF file acquisition module, a PDF file analysis module and a server; the analysis template configuration module performs analysis configuration according to the content format of the PDF file to be analyzed; the PDF file acquisition module acquires a PDF file according to a network path on the server, stores the PDF file into the server, calculates the MD5 of the acquired PDF file and verifies the PDF file; and the PDF file analysis module analyzes the PDF file according to the configuration analysis template. The invention configures the analysis template according to the PDF format by the user self-defining, and the system automatically acquires the file according to the file path, records and stores the file, thereby preventing the falsification, reducing the workload of artificially inputting data and the generated error, and ensuring the correctness of the experimental data.
Description
Technical Field
The invention relates to the field of data analysis, in particular to a PDF file data analysis system and a PDF file data analysis method.
Background
The information extraction is to perform structuring processing on the information contained in the text to form an organization form like a table. The input of the information extraction system is original text, and the output is fixed format information points. Information points are extracted from various documents and then integrated together in a unified form, which is the main task of information extraction. The benefit of integrating information together in a uniform fashion is ease of review and comparison. Information extraction techniques do not attempt to fully understand the entire document, but rather analyze portions of the document that contain relevant information. Which information is relevant depends on the domain scope under the timing of system design.
The core function of an experiment execution system (LES) or an electronic experiment recording system (ELN) is that result data in a measuring equipment output map file can be rapidly and accurately analyzed and stored in a designated field in a data air, and subsequent calling and calculation are facilitated.
PDF (Portable Document Format) is an electronic Document Format independent of hardware, operating system, and application program. Because of its advantages of cross-platform, multimedia integration, security, etc., PDF has become one of the most widely used electronic document formats at present. With the widespread use of PDF formatted documents, a great deal of valuable data is presented in the form of PDF documents. Therefore, how to extract data from a PDF document is a problem that has been widely noticed and researched.
Disclosure of Invention
The purpose of the invention is as follows: the invention provides a PDF file data analysis system and a method aiming at the defects, wherein a user configures an analysis template according to the format of PDF, so that one device can export the quick positioning analysis templates of various PDF file content formats, and the system automatically acquires files according to file paths, records and stores the files and prevents falsification. The workload of artificially inputting data and the generated errors are reduced, and the correctness of experimental data is ensured.
The technical scheme is as follows:
the PDF file data analysis system comprises an analysis template configuration module, a PDF file acquisition module, a PDF file analysis module and a server;
the analysis template configuration module performs analysis configuration according to the content format of the PDF file to be analyzed; the specific analysis configuration mode comprises the following steps:
(1) according to the key value pair of the PDF file content; the analysis template configuration module is provided with a configuration interface, a certain value in the key value pair configured on the analysis template configuration module is selected as a front key through a frame selection tool, and then a value is selected as a rear key, and the content between the two keys is the content to be analyzed;
(2) according to the coordinates of the content of the PDF file in the file, the coordinates comprise the page number, the scaling coefficient, the abscissa, the ordinate, the width and the height of the page number;
(3) according to the types of the regular tables in the PDF file, the types of the tables are divided into frames and non-frames;
(4) according to irregular tables in the PDF file;
the PDF file acquisition module acquires a PDF file according to a network path on the server, stores the PDF file into the server, calculates the MD5 of the acquired PDF file and verifies the PDF file;
and the PDF file analysis module analyzes the PDF file according to the configuration analysis template.
The PDF file acquisition module monitors a local or network path on the server in real time, and when newly-added PDF files are monitored in the local or network path, the corresponding PDF files are copied to a file directory to be analyzed of the system; then, performing feature analysis on the content of the PDF file to obtain basic information of the file, wherein the basic information comprises equipment, a sample and a matched analysis template;
the feature analysis is configured in the analysis template configuration module and is one or more feature values, and each feature is provided with a corresponding proportion; calculating the matching degree of the PDF files needing feature analysis, matching the PDF files according to each configured feature value, then adding the specific gravity of each feature value, and finally calculating the highest score, namely the best analysis template.
If the content of the PDF file to be analyzed is at the beginning or the end of the whole PDF file, the analysis template configuration module only needs to configure a front key or a back key.
A user selects and configures a specific analysis template according to actual requirements, and a plurality of commonly used analysis configuration templates are pre-configured in a server.
A PDF file data analysis method comprises the following steps:
the method comprises the following steps that firstly, an analysis template configuration module carries out analysis configuration according to the content format of a PDF file to be analyzed to obtain a configuration analysis template, and the specific analysis configuration mode comprises the following steps:
(1) according to the key value pair of the PDF file content;
(2) according to the coordinates of the content of the PDF file in the file, the coordinates comprise the page number, the scaling coefficient, the abscissa, the ordinate, the width and the height of the page number;
(3) according to the types of the regular tables in the PDF file, the types of the tables are divided into frames and non-frames;
(4) according to irregular tables in the PDF file;
acquiring PDF files on a network path through a PDF file acquisition module, storing the PDF files in a server, calculating MD5 of the acquired PDF files and verifying the PDF files;
and thirdly, the PDF file analysis module analyzes the PDF file according to the configuration analysis template.
Has the advantages that: the invention is suitable for analyzing and storing data in files such as atlas, report and report in PDF format, which are output after various measuring devices in inspection and analysis institutions such as scientific research laboratory, pharmaceutical enterprise quality control laboratory, third party detection laboratory and the like carry out analysis and detection. And acquiring data by adopting the relative position, key value and table data mode of PDF. The invention configures the analysis template according to the PDF format by the user self-defining, and the system automatically acquires the file according to the file path, records and stores the file, thereby preventing the falsification, reducing the workload of artificially inputting data and the generated error, and ensuring the correctness of the experimental data. The PDF analysis template is ensured to be in one-to-one correspondence with the actual file through feature analysis, and the analysis efficiency and accuracy are improved.
Drawings
FIG. 1 is a diagram of a configuration interface of an analytic template according to the present invention;
FIG. 2 is a flow chart of file monitoring in the present invention;
FIG. 3 is a schematic diagram of a file parsing data flow according to the present invention;
FIG. 4 is a diagram of an instrument configuration in accordance with an embodiment of the present invention;
fig. 5 is a graph of experimental data acquisition for a specific embodiment of the present invention.
Detailed Description
The invention is further elucidated with reference to the drawings and the embodiments.
The PDF file data analysis system comprises an analysis template configuration module, a PDF file acquisition module, a PDF file analysis module and a server;
the analysis template configuration module performs analysis configuration according to the content format of the PDF file to be analyzed, and the specific analysis configuration mode comprises the following steps:
(1) according to the key value pair of the PDF file content; the analysis template configuration module is provided with a configuration interface, a certain value in the key value pair configured on the analysis template configuration module is selected as a front key through a frame selection tool, and then a value is selected as a rear key, and the content between the two keys is the content to be analyzed; the PDF file content to be partially parsed may be at the beginning or end of the entire PDF file, and in this case, the front key or the back key may be configured.
(2) According to the coordinates of the content of the PDF file in the file, the coordinates comprise the page number, the scaling coefficient, the abscissa, the ordinate, the width and the height of the page number;
(3) according to the types of the regular tables in the PDF file, the types of the tables are divided into frames and non-frames;
(4) according to irregular tables in the PDF file;
the user selects and configures a specific analysis template according to actual requirements, and generally, a plurality of commonly used analysis configuration templates are pre-configured in the server.
The PDF file acquisition module acquires a PDF file according to a network path on the server, stores the PDF file into the server, calculates the MD5 of the acquired PDF file and verifies the PDF file; the method specifically comprises the following steps: the PDF file acquisition module monitors a local or network path on the server in real time, and when newly-added PDF files are monitored in the local or network path, the corresponding PDF files are copied to a file directory to be analyzed of the system; then, performing feature analysis on the content of the PDF file to obtain basic information of the file, wherein the basic information comprises equipment, a sample and a matched analysis template, and recording and processing logs to be stored in a server; the feature analysis is configured in the analysis template configuration module, the feature analysis can configure one or more feature values, each feature has a specific gravity (the total score is 100), then when the PDF file is analyzed, the matching degree is calculated for the PDF file needing the feature analysis, the PDF file is matched according to each configured feature value, then the specific gravities of the feature values are added, and finally the best analysis template is obtained after the calculated score is the highest. And finally returning to obtain the equipment type and the set of the analysis template by the feature analysis.
And the PDF file analysis module analyzes the PDF file according to the configuration analysis template.
The PDF file data analysis method adopting the PDF file data analysis system comprises the following steps:
the method comprises the following steps that firstly, an analysis template configuration module carries out analysis configuration according to the content format of a PDF file to be analyzed to obtain a configuration analysis template, and the specific analysis configuration mode comprises the following steps:
(1) according to the key value pair of the PDF file content;
(2) according to the coordinates of the content of the PDF file in the file, the coordinates comprise the page number, the scaling coefficient, the abscissa, the ordinate, the width and the height of the page number;
(3) according to the types of the regular tables in the PDF file, the types of the tables are divided into frames and non-frames;
(4) according to irregular tables in the PDF file;
acquiring PDF files on a network path through a PDF file acquisition module, storing the PDF files in a server, calculating MD5 of the acquired PDF files and verifying the PDF files;
and thirdly, the PDF file analysis module analyzes the PDF file according to the configuration analysis template.
The invention configures the analysis template according to the format of the PDF by the user self-definition, supports the characteristic analysis of the configured PDF, and realizes that one device exports the quick positioning analysis templates of various PDF file content formats. The system automatically acquires the file according to the file path, records and stores the file, and prevents falsification. The content of the file is automatically obtained and stored according to the analysis template, the workload of manually inputting data and errors generated are reduced, and the correctness of experimental data is ensured.
The PDF file content exported by the laboratory instrument in the pharmaceutical industry is automatically analyzed, original data are stored, the whole design idea of falsification is prevented, and any experiment execution system according to the design scheme can solve the problem of content analysis of the PDF file exported by instrument equipment and the problem of recording and auditing of experiment data.
Although the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the details of the foregoing embodiments, and various equivalent changes (such as number, shape, position, etc.) may be made to the technical solution of the present invention within the technical spirit of the present invention, and the equivalents are protected by the present invention.
Claims (5)
- The PDF file data analysis system is characterized in that: the system comprises an analysis template configuration module, a PDF file acquisition module, a PDF file analysis module and a server;the analysis template configuration module performs analysis configuration according to the content format of the PDF file to be analyzed; the specific analysis configuration mode comprises the following steps:(1) according to the key value pair of the PDF file content; the analysis template configuration module is provided with a configuration interface, a certain value in the key value pair configured on the analysis template configuration module is selected as a front key through a frame selection tool, and then a value is selected as a rear key, and the content between the two keys is the content to be analyzed;(2) according to the coordinates of the content of the PDF file in the file, the coordinates comprise the page number, the scaling coefficient, the abscissa, the ordinate, the width and the height of the page number;(3) according to the types of the regular tables in the PDF file, the types of the tables are divided into frames and non-frames;(4) according to irregular tables in the PDF file;the PDF file acquisition module acquires a PDF file according to a network path on the server, stores the PDF file into the server, calculates the MD5 of the acquired PDF file and verifies the PDF file;and the PDF file analysis module analyzes the PDF file according to the configuration analysis template.
- 2. The PDF file data parsing system according to claim 1, wherein: the PDF file acquisition module monitors a local or network path on the server in real time, and when newly-added PDF files are monitored in the local or network path, the corresponding PDF files are copied to a file directory to be analyzed of the system; then, performing feature analysis on the content of the PDF file to obtain basic information of the file, wherein the basic information comprises equipment, a sample and a matched analysis template;the feature analysis is configured in the analysis template configuration module and is one or more feature values, and each feature is provided with a corresponding proportion; calculating the matching degree of the PDF files needing feature analysis, matching the PDF files according to each configured feature value, then adding the specific gravity of each feature value, and finally calculating the highest score, namely the best analysis template.
- 3. The PDF file data parsing system according to claim 1, wherein: if the content of the PDF file to be analyzed is at the beginning or the end of the whole PDF file, the analysis template configuration module only needs to configure a front key or a back key.
- 4. The PDF file data parsing system according to claim 1, wherein: a user selects and configures a specific analysis template according to actual requirements, and a plurality of commonly used analysis configuration templates are pre-configured in a server.
- 5. A PDF file data analysis method applying the PDF file data analysis system of any one of claims 1-4, characterized in that: the method comprises the following steps:the method comprises the following steps that firstly, an analysis template configuration module carries out analysis configuration according to the content format of a PDF file to be analyzed to obtain a configuration analysis template, and the specific analysis configuration mode comprises the following steps:(1) according to the key value pair of the PDF file content;(2) according to the coordinates of the content of the PDF file in the file, the coordinates comprise the page number, the scaling coefficient, the abscissa, the ordinate, the width and the height of the page number;(3) according to the types of the regular tables in the PDF file, the types of the tables are divided into frames and non-frames;(4) according to irregular tables in the PDF file;acquiring PDF files on a network path through a PDF file acquisition module, storing the PDF files in a server, calculating MD5 of the acquired PDF files and verifying the PDF files;and thirdly, the PDF file analysis module analyzes the PDF file according to the configuration analysis template.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910730435.4A CN110609982A (en) | 2019-08-08 | 2019-08-08 | PDF file data analysis system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910730435.4A CN110609982A (en) | 2019-08-08 | 2019-08-08 | PDF file data analysis system and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110609982A true CN110609982A (en) | 2019-12-24 |
Family
ID=68889904
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910730435.4A Pending CN110609982A (en) | 2019-08-08 | 2019-08-08 | PDF file data analysis system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110609982A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111913910A (en) * | 2020-06-23 | 2020-11-10 | 复旦大学附属中山医院厦门医院 | Follow-up file data extraction method and system |
CN112861821A (en) * | 2021-04-06 | 2021-05-28 | 刘羽 | Map data reduction method based on PDF file analysis |
CN112861822A (en) * | 2021-04-06 | 2021-05-28 | 刘羽 | Map data processing method based on PDF file analysis |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090234818A1 (en) * | 2008-03-12 | 2009-09-17 | Web Access Inc. | Systems and Methods for Extracting Data from a Document in an Electronic Format |
CN105740267A (en) * | 2014-12-10 | 2016-07-06 | 北大方正集团有限公司 | PDF (Portable Document Format) file processing method and apparatus |
CN108415887A (en) * | 2018-02-09 | 2018-08-17 | 武汉大学 | A kind of method that pdf document is converted to OFD files |
CN109726369A (en) * | 2017-10-31 | 2019-05-07 | 中博信息技术研究院有限公司 | A kind of intelligent template questions record Implementation Technology based on normative document |
CN109726388A (en) * | 2018-05-07 | 2019-05-07 | 深圳壹账通智能科技有限公司 | Pdf document analytic method, device, equipment and computer readable storage medium |
-
2019
- 2019-08-08 CN CN201910730435.4A patent/CN110609982A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090234818A1 (en) * | 2008-03-12 | 2009-09-17 | Web Access Inc. | Systems and Methods for Extracting Data from a Document in an Electronic Format |
CN105740267A (en) * | 2014-12-10 | 2016-07-06 | 北大方正集团有限公司 | PDF (Portable Document Format) file processing method and apparatus |
CN109726369A (en) * | 2017-10-31 | 2019-05-07 | 中博信息技术研究院有限公司 | A kind of intelligent template questions record Implementation Technology based on normative document |
CN108415887A (en) * | 2018-02-09 | 2018-08-17 | 武汉大学 | A kind of method that pdf document is converted to OFD files |
CN109726388A (en) * | 2018-05-07 | 2019-05-07 | 深圳壹账通智能科技有限公司 | Pdf document analytic method, device, equipment and computer readable storage medium |
Non-Patent Citations (1)
Title |
---|
赵豪迈: "《数字档案长期保存研究》", 30 June 2015, 陕西师范大学出版总社有限公司 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111913910A (en) * | 2020-06-23 | 2020-11-10 | 复旦大学附属中山医院厦门医院 | Follow-up file data extraction method and system |
CN111913910B (en) * | 2020-06-23 | 2022-10-11 | 复旦大学附属中山医院厦门医院 | Follow-up file data extraction method and system |
CN112861821A (en) * | 2021-04-06 | 2021-05-28 | 刘羽 | Map data reduction method based on PDF file analysis |
CN112861822A (en) * | 2021-04-06 | 2021-05-28 | 刘羽 | Map data processing method based on PDF file analysis |
CN112861822B (en) * | 2021-04-06 | 2024-03-12 | 刘羽 | Map data processing method based on PDF file analysis |
CN112861821B (en) * | 2021-04-06 | 2024-04-19 | 刘羽 | Map data reduction method based on PDF file analysis |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102171220B1 (en) | Character recognition method, device, server and storage medium of claim documents | |
CN106445795B (en) | A kind of database SQL Efficiency testing method and device | |
CN106776515B (en) | Data processing method and device | |
WO2019237540A1 (en) | Method and device for acquiring financial data, terminal device, and medium | |
CN110609982A (en) | PDF file data analysis system and method | |
US11869263B2 (en) | Automated classification and interpretation of life science documents | |
CN106919612B (en) | Processing method and device for online structured query language script | |
CN112506951B (en) | Processing method, server, computing device and system for database slow query log | |
WO2020057021A1 (en) | Data table processing method and device, computer device and storage medium | |
US11163806B2 (en) | Obtaining candidates for a relationship type and its label | |
US9965540B1 (en) | System and method for facilitating associating semantic labels with content | |
CN110737689B (en) | Data standard compliance detection method, device, system and storage medium | |
US20140115438A1 (en) | Generation of test data using text analytics | |
US10121150B2 (en) | Compliance as a service for an organization | |
US11574491B2 (en) | Automated classification and interpretation of life science documents | |
CN112418813A (en) | AEO qualification intelligent rating management system and method based on intelligent analysis and identification and storage medium | |
KR20120003567A (en) | Log management system, log processing method of the same of and recording medium storing the log processing method of the same of | |
CN114115831A (en) | Data processing method, device, equipment and storage medium | |
CN115563985A (en) | Statement analysis method, statement analysis device, statement analysis apparatus, storage medium, and program product | |
US9275358B1 (en) | System, method, and computer program for automatically creating and submitting defect information associated with defects identified during a software development lifecycle to a defect tracking system | |
CN113032515A (en) | Method, system, device and storage medium for generating chart based on multiple data sources | |
CN112463728A (en) | Bibliographic data extraction method of scientific and technological literature | |
CN112416727A (en) | Batch processing operation checking method, device, equipment and medium | |
CN112163072B (en) | Data processing method and device based on multiple data sources | |
CN111858732B (en) | Data fusion method and terminal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191224 |
|
RJ01 | Rejection of invention patent application after publication |