CN110609982A

CN110609982A - PDF file data analysis system and method

Info

Publication number: CN110609982A
Application number: CN201910730435.4A
Authority: CN
Inventors: 马立鹏; 刘波; 孙晓晨; 徐焕锋
Original assignee: Zhejiang Supcon Technology Co Ltd
Current assignee: Zhejiang Supcon Technology Co Ltd
Priority date: 2019-08-08
Filing date: 2019-08-08
Publication date: 2019-12-24

Abstract

The invention discloses a PDF file data analysis system, which comprises an analysis template configuration module, a PDF file acquisition module, a PDF file analysis module and a server; the analysis template configuration module performs analysis configuration according to the content format of the PDF file to be analyzed; the PDF file acquisition module acquires a PDF file according to a network path on the server, stores the PDF file into the server, calculates the MD5 of the acquired PDF file and verifies the PDF file; and the PDF file analysis module analyzes the PDF file according to the configuration analysis template. The invention configures the analysis template according to the PDF format by the user self-defining, and the system automatically acquires the file according to the file path, records and stores the file, thereby preventing the falsification, reducing the workload of artificially inputting data and the generated error, and ensuring the correctness of the experimental data.

Description

PDF file data analysis system and method

Technical Field

The invention relates to the field of data analysis, in particular to a PDF file data analysis system and a PDF file data analysis method.

Background

The information extraction is to perform structuring processing on the information contained in the text to form an organization form like a table. The input of the information extraction system is original text, and the output is fixed format information points. Information points are extracted from various documents and then integrated together in a unified form, which is the main task of information extraction. The benefit of integrating information together in a uniform fashion is ease of review and comparison. Information extraction techniques do not attempt to fully understand the entire document, but rather analyze portions of the document that contain relevant information. Which information is relevant depends on the domain scope under the timing of system design.

The core function of an experiment execution system (LES) or an electronic experiment recording system (ELN) is that result data in a measuring equipment output map file can be rapidly and accurately analyzed and stored in a designated field in a data air, and subsequent calling and calculation are facilitated.

PDF (Portable Document Format) is an electronic Document Format independent of hardware, operating system, and application program. Because of its advantages of cross-platform, multimedia integration, security, etc., PDF has become one of the most widely used electronic document formats at present. With the widespread use of PDF formatted documents, a great deal of valuable data is presented in the form of PDF documents. Therefore, how to extract data from a PDF document is a problem that has been widely noticed and researched.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides a PDF file data analysis system and a method aiming at the defects, wherein a user configures an analysis template according to the format of PDF, so that one device can export the quick positioning analysis templates of various PDF file content formats, and the system automatically acquires files according to file paths, records and stores the files and prevents falsification. The workload of artificially inputting data and the generated errors are reduced, and the correctness of experimental data is ensured.

The technical scheme is as follows:

the PDF file data analysis system comprises an analysis template configuration module, a PDF file acquisition module, a PDF file analysis module and a server;

the analysis template configuration module performs analysis configuration according to the content format of the PDF file to be analyzed; the specific analysis configuration mode comprises the following steps:

(1) according to the key value pair of the PDF file content; the analysis template configuration module is provided with a configuration interface, a certain value in the key value pair configured on the analysis template configuration module is selected as a front key through a frame selection tool, and then a value is selected as a rear key, and the content between the two keys is the content to be analyzed;

(2) according to the coordinates of the content of the PDF file in the file, the coordinates comprise the page number, the scaling coefficient, the abscissa, the ordinate, the width and the height of the page number;

(3) according to the types of the regular tables in the PDF file, the types of the tables are divided into frames and non-frames;

(4) according to irregular tables in the PDF file;

the PDF file acquisition module acquires a PDF file according to a network path on the server, stores the PDF file into the server, calculates the MD5 of the acquired PDF file and verifies the PDF file;

and the PDF file analysis module analyzes the PDF file according to the configuration analysis template.

The PDF file acquisition module monitors a local or network path on the server in real time, and when newly-added PDF files are monitored in the local or network path, the corresponding PDF files are copied to a file directory to be analyzed of the system; then, performing feature analysis on the content of the PDF file to obtain basic information of the file, wherein the basic information comprises equipment, a sample and a matched analysis template;

the feature analysis is configured in the analysis template configuration module and is one or more feature values, and each feature is provided with a corresponding proportion; calculating the matching degree of the PDF files needing feature analysis, matching the PDF files according to each configured feature value, then adding the specific gravity of each feature value, and finally calculating the highest score, namely the best analysis template.

If the content of the PDF file to be analyzed is at the beginning or the end of the whole PDF file, the analysis template configuration module only needs to configure a front key or a back key.

A user selects and configures a specific analysis template according to actual requirements, and a plurality of commonly used analysis configuration templates are pre-configured in a server.

A PDF file data analysis method comprises the following steps:

the method comprises the following steps that firstly, an analysis template configuration module carries out analysis configuration according to the content format of a PDF file to be analyzed to obtain a configuration analysis template, and the specific analysis configuration mode comprises the following steps:

(1) according to the key value pair of the PDF file content;

(4) according to irregular tables in the PDF file;

acquiring PDF files on a network path through a PDF file acquisition module, storing the PDF files in a server, calculating MD5 of the acquired PDF files and verifying the PDF files;

and thirdly, the PDF file analysis module analyzes the PDF file according to the configuration analysis template.

Has the advantages that: the invention is suitable for analyzing and storing data in files such as atlas, report and report in PDF format, which are output after various measuring devices in inspection and analysis institutions such as scientific research laboratory, pharmaceutical enterprise quality control laboratory, third party detection laboratory and the like carry out analysis and detection. And acquiring data by adopting the relative position, key value and table data mode of PDF. The invention configures the analysis template according to the PDF format by the user self-defining, and the system automatically acquires the file according to the file path, records and stores the file, thereby preventing the falsification, reducing the workload of artificially inputting data and the generated error, and ensuring the correctness of the experimental data. The PDF analysis template is ensured to be in one-to-one correspondence with the actual file through feature analysis, and the analysis efficiency and accuracy are improved.

Drawings

FIG. 1 is a diagram of a configuration interface of an analytic template according to the present invention;

FIG. 2 is a flow chart of file monitoring in the present invention;

FIG. 3 is a schematic diagram of a file parsing data flow according to the present invention;

FIG. 4 is a diagram of an instrument configuration in accordance with an embodiment of the present invention;

fig. 5 is a graph of experimental data acquisition for a specific embodiment of the present invention.

Detailed Description

The invention is further elucidated with reference to the drawings and the embodiments.

the analysis template configuration module performs analysis configuration according to the content format of the PDF file to be analyzed, and the specific analysis configuration mode comprises the following steps:

(1) according to the key value pair of the PDF file content; the analysis template configuration module is provided with a configuration interface, a certain value in the key value pair configured on the analysis template configuration module is selected as a front key through a frame selection tool, and then a value is selected as a rear key, and the content between the two keys is the content to be analyzed; the PDF file content to be partially parsed may be at the beginning or end of the entire PDF file, and in this case, the front key or the back key may be configured.

(4) according to irregular tables in the PDF file;

the user selects and configures a specific analysis template according to actual requirements, and generally, a plurality of commonly used analysis configuration templates are pre-configured in the server.

The PDF file acquisition module acquires a PDF file according to a network path on the server, stores the PDF file into the server, calculates the MD5 of the acquired PDF file and verifies the PDF file; the method specifically comprises the following steps: the PDF file acquisition module monitors a local or network path on the server in real time, and when newly-added PDF files are monitored in the local or network path, the corresponding PDF files are copied to a file directory to be analyzed of the system; then, performing feature analysis on the content of the PDF file to obtain basic information of the file, wherein the basic information comprises equipment, a sample and a matched analysis template, and recording and processing logs to be stored in a server; the feature analysis is configured in the analysis template configuration module, the feature analysis can configure one or more feature values, each feature has a specific gravity (the total score is 100), then when the PDF file is analyzed, the matching degree is calculated for the PDF file needing the feature analysis, the PDF file is matched according to each configured feature value, then the specific gravities of the feature values are added, and finally the best analysis template is obtained after the calculated score is the highest. And finally returning to obtain the equipment type and the set of the analysis template by the feature analysis.

The PDF file data analysis method adopting the PDF file data analysis system comprises the following steps:

(1) according to the key value pair of the PDF file content;

(4) according to irregular tables in the PDF file;

The invention configures the analysis template according to the format of the PDF by the user self-definition, supports the characteristic analysis of the configured PDF, and realizes that one device exports the quick positioning analysis templates of various PDF file content formats. The system automatically acquires the file according to the file path, records and stores the file, and prevents falsification. The content of the file is automatically obtained and stored according to the analysis template, the workload of manually inputting data and errors generated are reduced, and the correctness of experimental data is ensured.

The PDF file content exported by the laboratory instrument in the pharmaceutical industry is automatically analyzed, original data are stored, the whole design idea of falsification is prevented, and any experiment execution system according to the design scheme can solve the problem of content analysis of the PDF file exported by instrument equipment and the problem of recording and auditing of experiment data.

Although the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the details of the foregoing embodiments, and various equivalent changes (such as number, shape, position, etc.) may be made to the technical solution of the present invention within the technical spirit of the present invention, and the equivalents are protected by the present invention.

Claims

The PDF file data analysis system is characterized in that: the system comprises an analysis template configuration module, a PDF file acquisition module, a PDF file analysis module and a server;

the analysis template configuration module performs analysis configuration according to the content format of the PDF file to be analyzed; the specific analysis configuration mode comprises the following steps:

(1) according to the key value pair of the PDF file content; the analysis template configuration module is provided with a configuration interface, a certain value in the key value pair configured on the analysis template configuration module is selected as a front key through a frame selection tool, and then a value is selected as a rear key, and the content between the two keys is the content to be analyzed;

(2) according to the coordinates of the content of the PDF file in the file, the coordinates comprise the page number, the scaling coefficient, the abscissa, the ordinate, the width and the height of the page number;

(3) according to the types of the regular tables in the PDF file, the types of the tables are divided into frames and non-frames;

(4) according to irregular tables in the PDF file;

the PDF file acquisition module acquires a PDF file according to a network path on the server, stores the PDF file into the server, calculates the MD5 of the acquired PDF file and verifies the PDF file;

and the PDF file analysis module analyzes the PDF file according to the configuration analysis template.
2. The PDF file data parsing system according to claim 1, wherein: the PDF file acquisition module monitors a local or network path on the server in real time, and when newly-added PDF files are monitored in the local or network path, the corresponding PDF files are copied to a file directory to be analyzed of the system; then, performing feature analysis on the content of the PDF file to obtain basic information of the file, wherein the basic information comprises equipment, a sample and a matched analysis template;

the feature analysis is configured in the analysis template configuration module and is one or more feature values, and each feature is provided with a corresponding proportion; calculating the matching degree of the PDF files needing feature analysis, matching the PDF files according to each configured feature value, then adding the specific gravity of each feature value, and finally calculating the highest score, namely the best analysis template.
3. The PDF file data parsing system according to claim 1, wherein: if the content of the PDF file to be analyzed is at the beginning or the end of the whole PDF file, the analysis template configuration module only needs to configure a front key or a back key.
4. The PDF file data parsing system according to claim 1, wherein: a user selects and configures a specific analysis template according to actual requirements, and a plurality of commonly used analysis configuration templates are pre-configured in a server.
5. A PDF file data analysis method applying the PDF file data analysis system of any one of claims 1-4, characterized in that: the method comprises the following steps:

the method comprises the following steps that firstly, an analysis template configuration module carries out analysis configuration according to the content format of a PDF file to be analyzed to obtain a configuration analysis template, and the specific analysis configuration mode comprises the following steps:

(1) according to the key value pair of the PDF file content;

(2) according to the coordinates of the content of the PDF file in the file, the coordinates comprise the page number, the scaling coefficient, the abscissa, the ordinate, the width and the height of the page number;

(3) according to the types of the regular tables in the PDF file, the types of the tables are divided into frames and non-frames;

(4) according to irregular tables in the PDF file;

acquiring PDF files on a network path through a PDF file acquisition module, storing the PDF files in a server, calculating MD5 of the acquired PDF files and verifying the PDF files;

and thirdly, the PDF file analysis module analyzes the PDF file according to the configuration analysis template.