CN117473980A

CN117473980A - Structured analysis method of portable document format file and related products

Info

Publication number: CN117473980A
Application number: CN202311498326.7A
Authority: CN
Inventors: 唐小利; 李晓瑛; 刘宇炀; 杨雪梅; 王超
Original assignee: Institute of Medical Information CAMS
Current assignee: Institute of Medical Information CAMS
Priority date: 2023-11-10
Filing date: 2023-11-10
Publication date: 2024-01-30

Abstract

The application provides a structured analysis method of a portable document format file and a related product, which can be applied to the technical field of data processing, and the method comprises the following steps: extracting metadata information, content information and page size information corresponding to the portable document format file; determining a type area of a preset picture format file corresponding to a page of the portable document format file by using the trained intelligent file analysis model; based on page size information, text coordinates and picture coordinates, matching texts and pictures with type areas by using a trained file intelligent analysis model to obtain first structured data; performing association mapping on the reference document and the reference sentence by using the regular expression and the text coordinate to obtain second structured data; and associating and outputting the metadata information and the second structured data. Therefore, the text and the picture are matched and associated with the type area by using the trained file intelligent analysis model, so that the analysis accuracy is improved.

Description

Structured analysis method of portable document format file and related products

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a method for structured parsing of a portable document format file and related products.

Background

In order to present the portable document format file to the user in a more browsable manner, a structured parsing process is required for the portable document format file.

The existing structured analysis method generally obtains the content text information of the file by analyzing and identifying the image layout of the file. However, the method has a certain error rate and cannot accurately restore the text content of the file, so that the problem of insufficient analysis accuracy is caused.

Therefore, how to improve the resolution accuracy is a problem that the skilled person needs to solve.

Disclosure of Invention

Based on the problems, the application provides a structured analysis method of a portable document format file and related products, and based on page size information, text coordinates and picture coordinates, a trained intelligent file analysis model is utilized to carry out matching association on texts and pictures and type areas, so that the problem of insufficient analysis accuracy in the prior art is solved.

In a first aspect, the present application provides a method for structured parsing of a portable document format file, including:

analyzing a portable document format file, and extracting metadata information, content information and page size information corresponding to the portable document format file; the content information includes: text, picture, text coordinates, and picture coordinates;

performing layout analysis on a preset picture format file corresponding to a page of the portable document format file by using a trained file intelligent analysis model, and determining a type area of the preset picture format file;

based on the page size information, the text coordinates and the picture coordinates, matching the text and the picture with the type region by using the trained file intelligent analysis model to obtain first structured data;

based on the first structured data, performing association mapping on the reference document and the reference sentence by using a regular expression and the text coordinates to obtain second structured data;

and correlating and outputting the metadata information and the second structured data to realize the structured analysis of the portable document format file.

Optionally, the parsing the portable document format file, extracting metadata information, content information and page size information corresponding to the portable document format file, includes:

analyzing header file information of a portable document format file, and extracting metadata information corresponding to the portable document format file;

and analyzing the portable document format file based on an open source library, and extracting content information and page size information corresponding to the portable document format file.

Optionally, the performing layout analysis on the preset picture format file corresponding to the page of the portable document format file by using the trained intelligent file analysis model, before determining the type area of the preset picture format file, further includes:

and converting the page of the portable document format file into a preset picture format file.

constructing a sample data set;

performing layout marking on the sample data in the sample data set by using a deep learning image marking tool to obtain a model tuning training set;

and training the basic file intelligent analysis model by using the model tuning training set to obtain a trained file intelligent analysis model.

Optionally, the constructing a sample dataset includes:

acquiring a training file;

performing page-to-picture processing on the training file to obtain a training picture;

carrying out gray level conversion processing, image smoothing processing, edge detection processing and binarization preprocessing on the training picture to obtain sample data;

a sample data set is constructed using the sample data.

Optionally, the training the basic intelligent file analysis model by using the model tuning training set to obtain a trained intelligent file analysis model includes:

and training the basic file intelligent analysis model by using the model tuning training set and the intelligent document multi-mode pre-training model through a hundred-degree flying oar deep learning framework to obtain a trained file intelligent analysis model.

Optionally, the performing layout analysis on the preset picture format file corresponding to the page of the portable document format file by using the trained intelligent file analysis model, and determining the type area of the preset picture format file includes:

and carrying out layout analysis on a preset picture format file corresponding to a page of the portable document format file by using a trained file intelligent analysis model, and determining a header area, a footer area, a title area, an author unit area, a chapter area, a paragraph area, a picture text area, a table text area, a formula area and a reference document area of the preset picture format file.

In a second aspect, the present application provides a structured document parsing apparatus for a portable document format file, including:

the analysis module is used for analyzing the portable document format file and extracting metadata information, content information and page size information corresponding to the portable document format file; the content information includes: text, picture, text coordinates, and picture coordinates;

the analysis module is used for carrying out layout analysis on a preset picture format file corresponding to a page of the portable document format file by using the trained intelligent file analysis model, and determining a type area of the preset picture format file;

the matching module is used for matching the text, the picture and the type area by utilizing the trained file intelligent analysis model based on the page size information, the text coordinates and the picture coordinates to obtain first structured data;

the mapping module is used for carrying out association mapping on the reference document and the reference sentence by utilizing a regular expression and the text coordinates based on the first structured data to obtain second structured data;

and the output module is used for correlating and outputting the metadata information and the second structured data to realize the structured analysis of the portable document format file.

In a third aspect, the present application provides a structured parsing apparatus for a portable document format file, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the method for structured parsing of a portable document format file as described in any one of the above when executing the computer program.

In a fourth aspect, the present application provides a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a method of structured parsing of a portable document format file as described in any of the preceding claims.

From the above technical solution, compared with the prior art, the present application has the following advantages:

the method and the device analyze the portable document format file and extract metadata information, content information and page size information corresponding to the portable document format file. Wherein the content information includes: text, picture, text coordinates, and picture coordinates. And performing layout analysis on the preset picture format file corresponding to the page of the portable document format file by using the trained file intelligent analysis model, determining a type area of the preset picture format file, and matching texts and pictures with the type area by using the trained file intelligent analysis model based on page size information, text coordinates and picture coordinates to obtain first structured data. And finally, based on the first structured data, carrying out association mapping on the reference document and the reference sentence by using the regular expression and the text coordinate to obtain second structured data, and carrying out association and output on the metadata information and the second structured data to realize the structural analysis on the portable document format file. Therefore, based on page size information, text coordinates and picture coordinates, the text and the picture are matched and associated with the type area by using the trained file intelligent analysis model, so that the analysis accuracy is improved.

Drawings

FIG. 1 is a flow chart of a method for structured parsing of a portable document format file provided herein;

fig. 2 is a schematic structural diagram of a device for structuring and parsing a portable document format file provided in the present application.

Detailed Description

As described above, the existing structural analysis method has the problem of insufficient analysis accuracy. Specifically, the existing structured analysis method generally obtains the positions of the areas where the head, the bottom, the title, the paragraphs, the pictures and the like of the portable document format file are located by analyzing and identifying the image layout of the file, and then identifies and obtains the texts of different area positions based on the text identification technology. Although the current character recognition technology is mature, the accuracy of the percentage cannot be guaranteed, so that the problem of insufficient analysis accuracy is caused.

In order to solve the above problems, the present application provides a method for structured parsing of a portable document format file, including: firstly, analyzing the portable document format file, and extracting metadata information, content information and page size information corresponding to the portable document format file. Wherein the content information includes: text, picture, text coordinates, and picture coordinates. And performing layout analysis on the preset picture format file corresponding to the page of the portable document format file by using the trained file intelligent analysis model, determining a type area of the preset picture format file, and matching texts and pictures with the type area by using the trained file intelligent analysis model based on page size information, text coordinates and picture coordinates to obtain first structured data. And finally, based on the first structured data, carrying out association mapping on the reference document and the reference sentence by using the regular expression and the text coordinate to obtain second structured data, and carrying out association and output on the metadata information and the second structured data to realize the structural analysis on the portable document format file.

Therefore, based on page size information, text coordinates and picture coordinates, the text and the picture are matched and associated with the type area by using the trained file intelligent analysis model, so that the analysis accuracy is improved.

It should be noted that the method for structured parsing of the portable document format file and the related products provided by the application can be applied to the technical field of data processing. The foregoing is merely an example, and is not intended to limit the application fields of the method for structured parsing of a portable document format file and related products provided in the present application.

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Fig. 1 is a flowchart of a method for structured parsing of a portable document format file provided in the present application. Referring to fig. 1, a method for structured parsing of a portable document format file provided in the present application may include:

s101: analyzing a portable document format file, and extracting metadata information, content information and page size information corresponding to the portable document format file; the content information includes: text, picture, text coordinates, and picture coordinates.

In practical application, the device for structured parsing of a portable document format file first obtains a portable document format (Portable Document Format, PDF) file that needs to be structured parsed. And then analyzing the portable document format file, namely analyzing the PDF file to obtain metadata information of the PDF file. In addition, the structural analysis device of the portable document format file also needs to analyze the PDF file stream data to obtain the content information and page size information of the PDF file. The content information comprises text, pictures, text coordinates and picture coordinates, and the page size information refers to the width and the height of a PDF file page.

In addition, since the manners of parsing the portable document format file are not the same, the present application can be described in terms of one possible parsing manner.

In one case, it is directed to how to parse the portable document format file. Correspondingly, the parsing the portable document format file, extracting metadata information, content information and page size information corresponding to the portable document format file, includes:

In practical application, the device for analyzing the structure of the portable document format file can analyze the header file information of the PDF file to obtain the corresponding metadata information. The metadata of the PDF file mainly includes: file type, file size, file title, file theme, file related keywords, file author, file creation date, last file modification date, name of the program that originally created the file, name of the program that converted the file into PDF, and the like. In addition, the device for analyzing the structure of the portable document format file also needs to analyze the core content of the PDF file so as to obtain the corresponding content information and page size information. It should be noted that, the structural analysis device of the portable document format file uses an open-source lightweight PDF analysis library to analyze the core content of the PDF file through a corresponding interface.

S102: and performing layout analysis on the preset picture format file corresponding to the page of the portable document format file by using the trained intelligent file analysis model, and determining the type area of the preset picture format file.

In practical application, the structured parsing device of the portable document format file can utilize the trained intelligent file analysis model to perform layout analysis on the preset picture format file corresponding to the page of the PDF file, and determine the type area of the preset picture format file. Note that the preset picture format file is a PNG (Portable Network Graphics ) image file.

In addition, since the preprocessing modes before layout analysis are different, the present application can be described with respect to one possible preprocessing mode.

In one case, it is directed to how to preprocess a portable document format file. Correspondingly, the performing layout analysis on the preset picture format file corresponding to the page of the portable document format file by using the trained intelligent file analysis model, before determining the type area of the preset picture format file, further comprises:

In practical application, since the intelligent file analysis model can only perform layout analysis on a picture, each page of the PDF file needs to be converted into a file with a preset picture format, that is, into a PNG image file.

In addition, since the manner of determining the type area of the preset picture format file is not the same, the present application can be described with respect to one possible determination manner.

In one case, it is directed to how to determine the type area of the preset picture format file. Correspondingly, S102: performing layout analysis on a preset picture format file corresponding to a page of the portable document format file by using a trained file intelligent analysis model, and determining a type area of the preset picture format file specifically may include:

In practical application, the application performs layout analysis on the preset picture format file corresponding to the page of the portable document format file by using the trained file intelligent analysis model, so that thirteen types of areas can be determined, specifically including a header area, a footer area, a header area, an author unit area, a chapter area, a paragraph area, a picture text area, a table text area, a formula area and a reference document area.

S103: and matching the text and the picture with the type region by using the trained file intelligent analysis model based on the page size information, the text coordinates and the picture coordinates to obtain first structured data.

In practical application, the page size information, text coordinates and picture coordinates obtained by the analysis are used for matching and associating with layout area coordinates processed by the file intelligent analysis model. Specifically, the device for structuring and parsing the portable document format file needs to correspond the text coordinates and the picture coordinates obtained by parsing to the coordinates of the type areas, so that the text and the picture are correspondingly filled in different type areas. For example, if the structured parsing device of the portable document format file determines that the header area and the picture area exist in the current PNG image file by using the trained file intelligent analysis model, wherein the coordinates of the header area are (3, 3), the coordinates of the picture area are (3, 2), the coordinates of the text a are (3, 3), and the coordinates of the picture B are (3, 2), the structured parsing device of the portable document format file fills the text a into the header area and fills the picture B into the picture area by using the trained file intelligent analysis model, thereby realizing the matching of the text and the picture with the type area and obtaining the first structured data. It should be noted that the resolution of the page of the PDF file and the PNG image file needs to be unified based on the page size information before matching association is performed.

In addition, since the ways of training the file intelligent analysis model are different, the present application can be described with respect to one possible training way.

In one case, the model is intelligently analyzed for how the file is trained. Correspondingly, the performing layout analysis on the preset picture format file corresponding to the page of the portable document format file by using the trained intelligent file analysis model, before determining the type area of the preset picture format file, further comprises:

constructing a sample data set;

In practical application, the structured parsing device of the portable document format file constructs a sample data set containing a plurality of sample data, and marks the sample data in the sample data set through a labeme (deep learning image marking tool), so as to obtain a marked small sample data set, namely a model tuning training set. The label types include a header, a footer, a title, an author unit, a chapter, a paragraph, a picture text, a table text, a formula, and a reference. And then training a basic PP-structure V2 (file intelligent analysis) model by using the marked small sample data set through a PaddlePaddle (hundred-degree fly-by-deep learning framework), so as to obtain a trained file intelligent analysis model.

In addition, since the manner in which the sample data sets are constructed is not the same, the present application may be described in terms of one possible manner of construction.

In one case, it is directed to how to construct the sample dataset. Accordingly, the constructing a sample dataset includes:

acquiring a training file;

a sample data set is constructed using the sample data.

In practical application, the structured analysis device for the portable document format file randomly selects 5000 sample files from the PDF electronic file database to serve as training files, wherein 4000 sample files serve as test samples, and 1000 sample files serve as verification samples. Because the intelligent analysis model of the file performs layout analysis on the page, 5000 sample files are required to be converted into pictures to serve as training pictures, and the total pictures are 25000, wherein 20000 pictures are training sets, and 5000 pictures are verification sets. And finally, carrying out gray level conversion processing, image smoothing processing, edge detection processing and binarization preprocessing on the training picture to obtain sample data, and constructing a sample data set by utilizing the sample data.

In addition, since the ways of training the file intelligent analysis model are different, the present application can be described in terms of one possible training way.

In one case, the model is intelligently analyzed for how the file is trained. Correspondingly, the training of the basic file intelligent analysis model by using the model tuning training set to obtain a trained file intelligent analysis model comprises the following steps:

In practical application, a basic PP-structure V2 (file intelligent analysis) model is trained by using a model tuning training set based on LayoutLMv3 (intelligent document multi-mode pre-training model) of Microsoft subgasmic institute through a PaddlePaddle (hundred-degree fly-by-deep learning framework), so that a trained file intelligent analysis model is obtained. The PP-structure V2 is an intelligent file analysis model which is self-developed and developed by hundred-degree team, and aims to help a developer to better complete file understanding related tasks such as layout analysis, form recognition and the like. When the PP-structureV2 is applied, firstly, the image correction module is used for judging the direction of the whole image and completing correction, and then, the tasks of layout analysis and key information extraction can be completed. In the layout analysis task, the image is first subjected to a layout analysis model, the image is divided into different types of areas such as text, tables and images, and then the areas are respectively identified. For example, the form area is sent to a form recognition module for structural recognition, the text area is sent to an OCR (Optical Character Recognition ) engine for text recognition, and finally the form area is restored to a PDF file consistent with the original image layout by using a layout restoration module. In the key information extraction task, firstly, an OCR engine is used for extracting text content, then a semantic entity recognition module is used for acquiring semantic entities in images, and finally, a relation extraction module is used for acquiring corresponding relations among the semantic entities, so that the needed key information is extracted. In addition, the LayoutLMv3 is an intelligent document multi-mode pre-training model which is pre-trained and started by Microsoft subgasket, a self-supervision pre-training method of mask language modeling is provided in natural language processing research, and the representation with context semantics is learned by randomly masking a certain proportion of words in a text and reconstructing the masked words according to the context.

S104: and based on the first structured data, performing association mapping on the reference document and the reference sentence by using a regular expression and the text coordinates to obtain second structured data.

In practical application, after the structural analysis device of the portable document format file fills the text and the picture into the corresponding type area, the association mapping of the reference document and the citation sentence is continued. Specifically, based on the first structured data, the association mapping of the reference and the citation sentence is performed through the regular expression and the text coordinates. Where the cited statements refer to references, often by brackets, middle brackets, and numerical indices. For example: corresponding by (1), (1, 2, 5), (1-3, 5), [1], [1,2,5] and [1-3,5], or by numerical superscripts. The reference sentences with brackets added with digital types can acquire indexes of the references through regular expressions and are mapped in a correlated manner with the references to obtain second structured data; the reference sentence of the number superscript type can be judged by using text coordinates and front and rear characters, so that the index of the reference document is obtained, and the index is associated and mapped with the reference document to obtain second structured data. In addition, journal information and year information of the PDF file can be obtained by matching the determined header area and footer area by using a regular expression.

S105: and correlating and outputting the metadata information and the second structured data to realize the structured analysis of the portable document format file.

In practical application, the structural analysis device of the portable document format file matches the text and the picture with the type area, then carries out association mapping on the reference document and the reference sentence, and associates and outputs the obtained second structural data with the metadata information serving as auxiliary information, thereby realizing structural analysis and structural output of the portable document format file. It should be noted that, the method for structured parsing of the portable document format file provided by the present application may perform structured parsing on a dual-layer PDF file.

In summary, the present application first analyzes the portable document format file, and extracts metadata information, content information, and page size information corresponding to the portable document format file. Wherein the content information includes: text, picture, text coordinates, and picture coordinates. And performing layout analysis on the preset picture format file corresponding to the page of the portable document format file by using the trained file intelligent analysis model, determining a type area of the preset picture format file, and matching texts and pictures with the type area by using the trained file intelligent analysis model based on page size information, text coordinates and picture coordinates to obtain first structured data. And finally, based on the first structured data, carrying out association mapping on the reference document and the reference sentence by using the regular expression and the text coordinate to obtain second structured data, and carrying out association and output on the metadata information and the second structured data to realize the structural analysis on the portable document format file. Therefore, based on page size information, text coordinates and picture coordinates, the text and the picture are matched and associated with the type area by using the trained file intelligent analysis model, so that the analysis accuracy is improved.

Based on the method for structured analysis of the portable document format file provided by the embodiment, the application also provides a device for structured analysis of the portable document format file. The structure parsing apparatus of the portable document format file will be described below with reference to the embodiments and drawings, respectively.

Fig. 2 is a schematic structural diagram of a device for structuring and parsing a portable document format file provided in the present application. Referring to fig. 2, a device 200 for structured parsing of a portable document format file according to an embodiment of the present application includes:

the parsing module 201 is configured to parse the portable document format file, and extract metadata information, content information and page size information corresponding to the portable document format file; the content information includes: text, picture, text coordinates, and picture coordinates;

the analysis module 202 is configured to perform layout analysis on a preset picture format file corresponding to a page of the portable document format file by using a trained file intelligent analysis model, and determine a type area of the preset picture format file;

the matching module 203 is configured to match the text and the picture with the type region by using the trained file intelligent analysis model based on the page size information, the text coordinates and the picture coordinates, so as to obtain first structured data;

the mapping module 204 is configured to perform association mapping on the reference document and the reference sentence by using a regular expression and the text coordinate based on the first structured data, so as to obtain second structured data;

and the output module 205 is configured to correlate and output the metadata information and the second structured data, so as to implement structural analysis on the portable document format file.

As an embodiment, the parsing module 201 is specifically configured to parse a portable document format file:

analyzing the portable document format file based on an open source library, and extracting content information and page size information corresponding to the portable document format file;

the content information includes: text, picture, text coordinates, and picture coordinates.

As an embodiment, the above-mentioned device 200 for analyzing the structure of the portable document format file further includes: a conversion module;

and the conversion module is used for converting the page of the portable document format file into a preset picture format file.

As an embodiment, the above-mentioned structured analysis device 200 for a portable document format file further includes: the system comprises a construction module, a labeling module and a training module;

a construction module for constructing a sample dataset;

the marking module is used for marking the layout of the sample data in the sample data set by using a deep learning image marking tool to obtain a model tuning training set;

and the training module is used for training the basic file intelligent analysis model by using the model tuning training set to obtain a trained file intelligent analysis model.

As an embodiment, the above building module is specifically configured to:

acquiring a training file;

a sample data set is constructed using the sample data.

As an implementation manner, the training module is specifically configured to train the intelligent file analysis model:

As an embodiment, the analysis module 202 is specifically configured to analyze how to layout a preset picture format file:

In addition, the application also provides a structured analysis device for the portable document format file, which comprises the following components: a memory for storing a computer program; a processor for implementing the steps of the method for structured parsing of a portable document format file as described in any one of the above when executing the computer program.

In addition, the application further provides a readable storage medium, wherein the readable storage medium stores a computer program, and the computer program realizes the steps of the method for analyzing the structure of the portable document format file according to any one of the above steps when being executed by a processor.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for structured parsing of a portable document format file, the method comprising:

2. The method of claim 1, wherein parsing the portable document format file to extract metadata information, content information, and page size information corresponding to the portable document format file comprises:

3. The method according to claim 1, wherein the performing layout analysis on the preset picture format file corresponding to the page of the portable document format file using the trained intelligent file analysis model, before determining the type area of the preset picture format file, further comprises:

4. The method according to claim 1, wherein the performing layout analysis on the preset picture format file corresponding to the page of the portable document format file using the trained intelligent file analysis model, before determining the type area of the preset picture format file, further comprises:

constructing a sample data set;

5. The method of claim 4, wherein constructing the sample dataset comprises:

acquiring a training file;

a sample data set is constructed using the sample data.

6. The method of claim 4, wherein training the underlying intelligent document analysis model using the model tuning training set to obtain a trained intelligent document analysis model comprises:

7. The method according to claim 1, wherein the performing layout analysis on the preset picture format file corresponding to the page of the portable document format file using the trained intelligent file analysis model, and determining the type area of the preset picture format file, includes:

8. A structured document parsing apparatus for a portable document format file, comprising:

9. A structured parsing apparatus for a portable document format file, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the structured parsing method of a portable document format file according to any one of claims 1 to 7 when executing said computer program.

10. A readable storage medium, wherein a computer program is stored on the readable storage medium, which when executed by a processor, implements the steps of the structured parsing method of a portable document format file according to any one of claims 1 to 7.