CN117473980A - Structured analysis method of portable document format file and related products - Google Patents

Structured analysis method of portable document format file and related products Download PDF

Info

Publication number
CN117473980A
CN117473980A CN202311498326.7A CN202311498326A CN117473980A CN 117473980 A CN117473980 A CN 117473980A CN 202311498326 A CN202311498326 A CN 202311498326A CN 117473980 A CN117473980 A CN 117473980A
Authority
CN
China
Prior art keywords
file
format file
portable document
document format
picture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311498326.7A
Other languages
Chinese (zh)
Inventor
唐小利
李晓瑛
刘宇炀
杨雪梅
王超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Medical Information CAMS
Original Assignee
Institute of Medical Information CAMS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Medical Information CAMS filed Critical Institute of Medical Information CAMS
Priority to CN202311498326.7A priority Critical patent/CN117473980A/en
Publication of CN117473980A publication Critical patent/CN117473980A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The application provides a structured analysis method of a portable document format file and a related product, which can be applied to the technical field of data processing, and the method comprises the following steps: extracting metadata information, content information and page size information corresponding to the portable document format file; determining a type area of a preset picture format file corresponding to a page of the portable document format file by using the trained intelligent file analysis model; based on page size information, text coordinates and picture coordinates, matching texts and pictures with type areas by using a trained file intelligent analysis model to obtain first structured data; performing association mapping on the reference document and the reference sentence by using the regular expression and the text coordinate to obtain second structured data; and associating and outputting the metadata information and the second structured data. Therefore, the text and the picture are matched and associated with the type area by using the trained file intelligent analysis model, so that the analysis accuracy is improved.

Description

Structured analysis method of portable document format file and related products
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a method for structured parsing of a portable document format file and related products.
Background
In order to present the portable document format file to the user in a more browsable manner, a structured parsing process is required for the portable document format file.
The existing structured analysis method generally obtains the content text information of the file by analyzing and identifying the image layout of the file. However, the method has a certain error rate and cannot accurately restore the text content of the file, so that the problem of insufficient analysis accuracy is caused.
Therefore, how to improve the resolution accuracy is a problem that the skilled person needs to solve.
Disclosure of Invention
Based on the problems, the application provides a structured analysis method of a portable document format file and related products, and based on page size information, text coordinates and picture coordinates, a trained intelligent file analysis model is utilized to carry out matching association on texts and pictures and type areas, so that the problem of insufficient analysis accuracy in the prior art is solved.
In a first aspect, the present application provides a method for structured parsing of a portable document format file, including:
analyzing a portable document format file, and extracting metadata information, content information and page size information corresponding to the portable document format file; the content information includes: text, picture, text coordinates, and picture coordinates;
performing layout analysis on a preset picture format file corresponding to a page of the portable document format file by using a trained file intelligent analysis model, and determining a type area of the preset picture format file;
based on the page size information, the text coordinates and the picture coordinates, matching the text and the picture with the type region by using the trained file intelligent analysis model to obtain first structured data;
based on the first structured data, performing association mapping on the reference document and the reference sentence by using a regular expression and the text coordinates to obtain second structured data;
and correlating and outputting the metadata information and the second structured data to realize the structured analysis of the portable document format file.
Optionally, the parsing the portable document format file, extracting metadata information, content information and page size information corresponding to the portable document format file, includes:
analyzing header file information of a portable document format file, and extracting metadata information corresponding to the portable document format file;
and analyzing the portable document format file based on an open source library, and extracting content information and page size information corresponding to the portable document format file.
Optionally, the performing layout analysis on the preset picture format file corresponding to the page of the portable document format file by using the trained intelligent file analysis model, before determining the type area of the preset picture format file, further includes:
and converting the page of the portable document format file into a preset picture format file.
Optionally, the performing layout analysis on the preset picture format file corresponding to the page of the portable document format file by using the trained intelligent file analysis model, before determining the type area of the preset picture format file, further includes:
constructing a sample data set;
performing layout marking on the sample data in the sample data set by using a deep learning image marking tool to obtain a model tuning training set;
and training the basic file intelligent analysis model by using the model tuning training set to obtain a trained file intelligent analysis model.
Optionally, the constructing a sample dataset includes:
acquiring a training file;
performing page-to-picture processing on the training file to obtain a training picture;
carrying out gray level conversion processing, image smoothing processing, edge detection processing and binarization preprocessing on the training picture to obtain sample data;
a sample data set is constructed using the sample data.
Optionally, the training the basic intelligent file analysis model by using the model tuning training set to obtain a trained intelligent file analysis model includes:
and training the basic file intelligent analysis model by using the model tuning training set and the intelligent document multi-mode pre-training model through a hundred-degree flying oar deep learning framework to obtain a trained file intelligent analysis model.
Optionally, the performing layout analysis on the preset picture format file corresponding to the page of the portable document format file by using the trained intelligent file analysis model, and determining the type area of the preset picture format file includes:
and carrying out layout analysis on a preset picture format file corresponding to a page of the portable document format file by using a trained file intelligent analysis model, and determining a header area, a footer area, a title area, an author unit area, a chapter area, a paragraph area, a picture text area, a table text area, a formula area and a reference document area of the preset picture format file.
In a second aspect, the present application provides a structured document parsing apparatus for a portable document format file, including:
the analysis module is used for analyzing the portable document format file and extracting metadata information, content information and page size information corresponding to the portable document format file; the content information includes: text, picture, text coordinates, and picture coordinates;
the analysis module is used for carrying out layout analysis on a preset picture format file corresponding to a page of the portable document format file by using the trained intelligent file analysis model, and determining a type area of the preset picture format file;
the matching module is used for matching the text, the picture and the type area by utilizing the trained file intelligent analysis model based on the page size information, the text coordinates and the picture coordinates to obtain first structured data;
the mapping module is used for carrying out association mapping on the reference document and the reference sentence by utilizing a regular expression and the text coordinates based on the first structured data to obtain second structured data;
and the output module is used for correlating and outputting the metadata information and the second structured data to realize the structured analysis of the portable document format file.
In a third aspect, the present application provides a structured parsing apparatus for a portable document format file, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method for structured parsing of a portable document format file as described in any one of the above when executing the computer program.
In a fourth aspect, the present application provides a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a method of structured parsing of a portable document format file as described in any of the preceding claims.
From the above technical solution, compared with the prior art, the present application has the following advantages:
the method and the device analyze the portable document format file and extract metadata information, content information and page size information corresponding to the portable document format file. Wherein the content information includes: text, picture, text coordinates, and picture coordinates. And performing layout analysis on the preset picture format file corresponding to the page of the portable document format file by using the trained file intelligent analysis model, determining a type area of the preset picture format file, and matching texts and pictures with the type area by using the trained file intelligent analysis model based on page size information, text coordinates and picture coordinates to obtain first structured data. And finally, based on the first structured data, carrying out association mapping on the reference document and the reference sentence by using the regular expression and the text coordinate to obtain second structured data, and carrying out association and output on the metadata information and the second structured data to realize the structural analysis on the portable document format file. Therefore, based on page size information, text coordinates and picture coordinates, the text and the picture are matched and associated with the type area by using the trained file intelligent analysis model, so that the analysis accuracy is improved.
Drawings
FIG. 1 is a flow chart of a method for structured parsing of a portable document format file provided herein;
fig. 2 is a schematic structural diagram of a device for structuring and parsing a portable document format file provided in the present application.
Detailed Description
As described above, the existing structural analysis method has the problem of insufficient analysis accuracy. Specifically, the existing structured analysis method generally obtains the positions of the areas where the head, the bottom, the title, the paragraphs, the pictures and the like of the portable document format file are located by analyzing and identifying the image layout of the file, and then identifies and obtains the texts of different area positions based on the text identification technology. Although the current character recognition technology is mature, the accuracy of the percentage cannot be guaranteed, so that the problem of insufficient analysis accuracy is caused.
In order to solve the above problems, the present application provides a method for structured parsing of a portable document format file, including: firstly, analyzing the portable document format file, and extracting metadata information, content information and page size information corresponding to the portable document format file. Wherein the content information includes: text, picture, text coordinates, and picture coordinates. And performing layout analysis on the preset picture format file corresponding to the page of the portable document format file by using the trained file intelligent analysis model, determining a type area of the preset picture format file, and matching texts and pictures with the type area by using the trained file intelligent analysis model based on page size information, text coordinates and picture coordinates to obtain first structured data. And finally, based on the first structured data, carrying out association mapping on the reference document and the reference sentence by using the regular expression and the text coordinate to obtain second structured data, and carrying out association and output on the metadata information and the second structured data to realize the structural analysis on the portable document format file.
Therefore, based on page size information, text coordinates and picture coordinates, the text and the picture are matched and associated with the type area by using the trained file intelligent analysis model, so that the analysis accuracy is improved.
It should be noted that the method for structured parsing of the portable document format file and the related products provided by the application can be applied to the technical field of data processing. The foregoing is merely an example, and is not intended to limit the application fields of the method for structured parsing of a portable document format file and related products provided in the present application.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
Fig. 1 is a flowchart of a method for structured parsing of a portable document format file provided in the present application. Referring to fig. 1, a method for structured parsing of a portable document format file provided in the present application may include:
s101: analyzing a portable document format file, and extracting metadata information, content information and page size information corresponding to the portable document format file; the content information includes: text, picture, text coordinates, and picture coordinates.
In practical application, the device for structured parsing of a portable document format file first obtains a portable document format (Portable Document Format, PDF) file that needs to be structured parsed. And then analyzing the portable document format file, namely analyzing the PDF file to obtain metadata information of the PDF file. In addition, the structural analysis device of the portable document format file also needs to analyze the PDF file stream data to obtain the content information and page size information of the PDF file. The content information comprises text, pictures, text coordinates and picture coordinates, and the page size information refers to the width and the height of a PDF file page.
In addition, since the manners of parsing the portable document format file are not the same, the present application can be described in terms of one possible parsing manner.
In one case, it is directed to how to parse the portable document format file. Correspondingly, the parsing the portable document format file, extracting metadata information, content information and page size information corresponding to the portable document format file, includes:
analyzing header file information of a portable document format file, and extracting metadata information corresponding to the portable document format file;
and analyzing the portable document format file based on an open source library, and extracting content information and page size information corresponding to the portable document format file.
In practical application, the device for analyzing the structure of the portable document format file can analyze the header file information of the PDF file to obtain the corresponding metadata information. The metadata of the PDF file mainly includes: file type, file size, file title, file theme, file related keywords, file author, file creation date, last file modification date, name of the program that originally created the file, name of the program that converted the file into PDF, and the like. In addition, the device for analyzing the structure of the portable document format file also needs to analyze the core content of the PDF file so as to obtain the corresponding content information and page size information. It should be noted that, the structural analysis device of the portable document format file uses an open-source lightweight PDF analysis library to analyze the core content of the PDF file through a corresponding interface.
S102: and performing layout analysis on the preset picture format file corresponding to the page of the portable document format file by using the trained intelligent file analysis model, and determining the type area of the preset picture format file.
In practical application, the structured parsing device of the portable document format file can utilize the trained intelligent file analysis model to perform layout analysis on the preset picture format file corresponding to the page of the PDF file, and determine the type area of the preset picture format file. Note that the preset picture format file is a PNG (Portable Network Graphics ) image file.
In addition, since the preprocessing modes before layout analysis are different, the present application can be described with respect to one possible preprocessing mode.
In one case, it is directed to how to preprocess a portable document format file. Correspondingly, the performing layout analysis on the preset picture format file corresponding to the page of the portable document format file by using the trained intelligent file analysis model, before determining the type area of the preset picture format file, further comprises:
and converting the page of the portable document format file into a preset picture format file.
In practical application, since the intelligent file analysis model can only perform layout analysis on a picture, each page of the PDF file needs to be converted into a file with a preset picture format, that is, into a PNG image file.
In addition, since the manner of determining the type area of the preset picture format file is not the same, the present application can be described with respect to one possible determination manner.
In one case, it is directed to how to determine the type area of the preset picture format file. Correspondingly, S102: performing layout analysis on a preset picture format file corresponding to a page of the portable document format file by using a trained file intelligent analysis model, and determining a type area of the preset picture format file specifically may include:
and carrying out layout analysis on a preset picture format file corresponding to a page of the portable document format file by using a trained file intelligent analysis model, and determining a header area, a footer area, a title area, an author unit area, a chapter area, a paragraph area, a picture text area, a table text area, a formula area and a reference document area of the preset picture format file.
In practical application, the application performs layout analysis on the preset picture format file corresponding to the page of the portable document format file by using the trained file intelligent analysis model, so that thirteen types of areas can be determined, specifically including a header area, a footer area, a header area, an author unit area, a chapter area, a paragraph area, a picture text area, a table text area, a formula area and a reference document area.
S103: and matching the text and the picture with the type region by using the trained file intelligent analysis model based on the page size information, the text coordinates and the picture coordinates to obtain first structured data.
In practical application, the page size information, text coordinates and picture coordinates obtained by the analysis are used for matching and associating with layout area coordinates processed by the file intelligent analysis model. Specifically, the device for structuring and parsing the portable document format file needs to correspond the text coordinates and the picture coordinates obtained by parsing to the coordinates of the type areas, so that the text and the picture are correspondingly filled in different type areas. For example, if the structured parsing device of the portable document format file determines that the header area and the picture area exist in the current PNG image file by using the trained file intelligent analysis model, wherein the coordinates of the header area are (3, 3), the coordinates of the picture area are (3, 2), the coordinates of the text a are (3, 3), and the coordinates of the picture B are (3, 2), the structured parsing device of the portable document format file fills the text a into the header area and fills the picture B into the picture area by using the trained file intelligent analysis model, thereby realizing the matching of the text and the picture with the type area and obtaining the first structured data. It should be noted that the resolution of the page of the PDF file and the PNG image file needs to be unified based on the page size information before matching association is performed.
In addition, since the ways of training the file intelligent analysis model are different, the present application can be described with respect to one possible training way.
In one case, the model is intelligently analyzed for how the file is trained. Correspondingly, the performing layout analysis on the preset picture format file corresponding to the page of the portable document format file by using the trained intelligent file analysis model, before determining the type area of the preset picture format file, further comprises:
constructing a sample data set;
performing layout marking on the sample data in the sample data set by using a deep learning image marking tool to obtain a model tuning training set;
and training the basic file intelligent analysis model by using the model tuning training set to obtain a trained file intelligent analysis model.
In practical application, the structured parsing device of the portable document format file constructs a sample data set containing a plurality of sample data, and marks the sample data in the sample data set through a labeme (deep learning image marking tool), so as to obtain a marked small sample data set, namely a model tuning training set. The label types include a header, a footer, a title, an author unit, a chapter, a paragraph, a picture text, a table text, a formula, and a reference. And then training a basic PP-structure V2 (file intelligent analysis) model by using the marked small sample data set through a PaddlePaddle (hundred-degree fly-by-deep learning framework), so as to obtain a trained file intelligent analysis model.
In addition, since the manner in which the sample data sets are constructed is not the same, the present application may be described in terms of one possible manner of construction.
In one case, it is directed to how to construct the sample dataset. Accordingly, the constructing a sample dataset includes:
acquiring a training file;
performing page-to-picture processing on the training file to obtain a training picture;
carrying out gray level conversion processing, image smoothing processing, edge detection processing and binarization preprocessing on the training picture to obtain sample data;
a sample data set is constructed using the sample data.
In practical application, the structured analysis device for the portable document format file randomly selects 5000 sample files from the PDF electronic file database to serve as training files, wherein 4000 sample files serve as test samples, and 1000 sample files serve as verification samples. Because the intelligent analysis model of the file performs layout analysis on the page, 5000 sample files are required to be converted into pictures to serve as training pictures, and the total pictures are 25000, wherein 20000 pictures are training sets, and 5000 pictures are verification sets. And finally, carrying out gray level conversion processing, image smoothing processing, edge detection processing and binarization preprocessing on the training picture to obtain sample data, and constructing a sample data set by utilizing the sample data.
In addition, since the ways of training the file intelligent analysis model are different, the present application can be described in terms of one possible training way.
In one case, the model is intelligently analyzed for how the file is trained. Correspondingly, the training of the basic file intelligent analysis model by using the model tuning training set to obtain a trained file intelligent analysis model comprises the following steps:
and training the basic file intelligent analysis model by using the model tuning training set and the intelligent document multi-mode pre-training model through a hundred-degree flying oar deep learning framework to obtain a trained file intelligent analysis model.
In practical application, a basic PP-structure V2 (file intelligent analysis) model is trained by using a model tuning training set based on LayoutLMv3 (intelligent document multi-mode pre-training model) of Microsoft subgasmic institute through a PaddlePaddle (hundred-degree fly-by-deep learning framework), so that a trained file intelligent analysis model is obtained. The PP-structure V2 is an intelligent file analysis model which is self-developed and developed by hundred-degree team, and aims to help a developer to better complete file understanding related tasks such as layout analysis, form recognition and the like. When the PP-structureV2 is applied, firstly, the image correction module is used for judging the direction of the whole image and completing correction, and then, the tasks of layout analysis and key information extraction can be completed. In the layout analysis task, the image is first subjected to a layout analysis model, the image is divided into different types of areas such as text, tables and images, and then the areas are respectively identified. For example, the form area is sent to a form recognition module for structural recognition, the text area is sent to an OCR (Optical Character Recognition ) engine for text recognition, and finally the form area is restored to a PDF file consistent with the original image layout by using a layout restoration module. In the key information extraction task, firstly, an OCR engine is used for extracting text content, then a semantic entity recognition module is used for acquiring semantic entities in images, and finally, a relation extraction module is used for acquiring corresponding relations among the semantic entities, so that the needed key information is extracted. In addition, the LayoutLMv3 is an intelligent document multi-mode pre-training model which is pre-trained and started by Microsoft subgasket, a self-supervision pre-training method of mask language modeling is provided in natural language processing research, and the representation with context semantics is learned by randomly masking a certain proportion of words in a text and reconstructing the masked words according to the context.
S104: and based on the first structured data, performing association mapping on the reference document and the reference sentence by using a regular expression and the text coordinates to obtain second structured data.
In practical application, after the structural analysis device of the portable document format file fills the text and the picture into the corresponding type area, the association mapping of the reference document and the citation sentence is continued. Specifically, based on the first structured data, the association mapping of the reference and the citation sentence is performed through the regular expression and the text coordinates. Where the cited statements refer to references, often by brackets, middle brackets, and numerical indices. For example: corresponding by (1), (1, 2, 5), (1-3, 5), [1], [1,2,5] and [1-3,5], or by numerical superscripts. The reference sentences with brackets added with digital types can acquire indexes of the references through regular expressions and are mapped in a correlated manner with the references to obtain second structured data; the reference sentence of the number superscript type can be judged by using text coordinates and front and rear characters, so that the index of the reference document is obtained, and the index is associated and mapped with the reference document to obtain second structured data. In addition, journal information and year information of the PDF file can be obtained by matching the determined header area and footer area by using a regular expression.
S105: and correlating and outputting the metadata information and the second structured data to realize the structured analysis of the portable document format file.
In practical application, the structural analysis device of the portable document format file matches the text and the picture with the type area, then carries out association mapping on the reference document and the reference sentence, and associates and outputs the obtained second structural data with the metadata information serving as auxiliary information, thereby realizing structural analysis and structural output of the portable document format file. It should be noted that, the method for structured parsing of the portable document format file provided by the present application may perform structured parsing on a dual-layer PDF file.
In summary, the present application first analyzes the portable document format file, and extracts metadata information, content information, and page size information corresponding to the portable document format file. Wherein the content information includes: text, picture, text coordinates, and picture coordinates. And performing layout analysis on the preset picture format file corresponding to the page of the portable document format file by using the trained file intelligent analysis model, determining a type area of the preset picture format file, and matching texts and pictures with the type area by using the trained file intelligent analysis model based on page size information, text coordinates and picture coordinates to obtain first structured data. And finally, based on the first structured data, carrying out association mapping on the reference document and the reference sentence by using the regular expression and the text coordinate to obtain second structured data, and carrying out association and output on the metadata information and the second structured data to realize the structural analysis on the portable document format file. Therefore, based on page size information, text coordinates and picture coordinates, the text and the picture are matched and associated with the type area by using the trained file intelligent analysis model, so that the analysis accuracy is improved.
Based on the method for structured analysis of the portable document format file provided by the embodiment, the application also provides a device for structured analysis of the portable document format file. The structure parsing apparatus of the portable document format file will be described below with reference to the embodiments and drawings, respectively.
Fig. 2 is a schematic structural diagram of a device for structuring and parsing a portable document format file provided in the present application. Referring to fig. 2, a device 200 for structured parsing of a portable document format file according to an embodiment of the present application includes:
the parsing module 201 is configured to parse the portable document format file, and extract metadata information, content information and page size information corresponding to the portable document format file; the content information includes: text, picture, text coordinates, and picture coordinates;
the analysis module 202 is configured to perform layout analysis on a preset picture format file corresponding to a page of the portable document format file by using a trained file intelligent analysis model, and determine a type area of the preset picture format file;
the matching module 203 is configured to match the text and the picture with the type region by using the trained file intelligent analysis model based on the page size information, the text coordinates and the picture coordinates, so as to obtain first structured data;
the mapping module 204 is configured to perform association mapping on the reference document and the reference sentence by using a regular expression and the text coordinate based on the first structured data, so as to obtain second structured data;
and the output module 205 is configured to correlate and output the metadata information and the second structured data, so as to implement structural analysis on the portable document format file.
As an embodiment, the parsing module 201 is specifically configured to parse a portable document format file:
analyzing header file information of a portable document format file, and extracting metadata information corresponding to the portable document format file;
analyzing the portable document format file based on an open source library, and extracting content information and page size information corresponding to the portable document format file;
the content information includes: text, picture, text coordinates, and picture coordinates.
As an embodiment, the above-mentioned device 200 for analyzing the structure of the portable document format file further includes: a conversion module;
and the conversion module is used for converting the page of the portable document format file into a preset picture format file.
As an embodiment, the above-mentioned structured analysis device 200 for a portable document format file further includes: the system comprises a construction module, a labeling module and a training module;
a construction module for constructing a sample dataset;
the marking module is used for marking the layout of the sample data in the sample data set by using a deep learning image marking tool to obtain a model tuning training set;
and the training module is used for training the basic file intelligent analysis model by using the model tuning training set to obtain a trained file intelligent analysis model.
As an embodiment, the above building module is specifically configured to:
acquiring a training file;
performing page-to-picture processing on the training file to obtain a training picture;
carrying out gray level conversion processing, image smoothing processing, edge detection processing and binarization preprocessing on the training picture to obtain sample data;
a sample data set is constructed using the sample data.
As an implementation manner, the training module is specifically configured to train the intelligent file analysis model:
and training the basic file intelligent analysis model by using the model tuning training set and the intelligent document multi-mode pre-training model through a hundred-degree flying oar deep learning framework to obtain a trained file intelligent analysis model.
As an embodiment, the analysis module 202 is specifically configured to analyze how to layout a preset picture format file:
and carrying out layout analysis on a preset picture format file corresponding to a page of the portable document format file by using a trained file intelligent analysis model, and determining a header area, a footer area, a title area, an author unit area, a chapter area, a paragraph area, a picture text area, a table text area, a formula area and a reference document area of the preset picture format file.
In summary, the present application first analyzes the portable document format file, and extracts metadata information, content information, and page size information corresponding to the portable document format file. Wherein the content information includes: text, picture, text coordinates, and picture coordinates. And performing layout analysis on the preset picture format file corresponding to the page of the portable document format file by using the trained file intelligent analysis model, determining a type area of the preset picture format file, and matching texts and pictures with the type area by using the trained file intelligent analysis model based on page size information, text coordinates and picture coordinates to obtain first structured data. And finally, based on the first structured data, carrying out association mapping on the reference document and the reference sentence by using the regular expression and the text coordinate to obtain second structured data, and carrying out association and output on the metadata information and the second structured data to realize the structural analysis on the portable document format file. Therefore, based on page size information, text coordinates and picture coordinates, the text and the picture are matched and associated with the type area by using the trained file intelligent analysis model, so that the analysis accuracy is improved.
In addition, the application also provides a structured analysis device for the portable document format file, which comprises the following components: a memory for storing a computer program; a processor for implementing the steps of the method for structured parsing of a portable document format file as described in any one of the above when executing the computer program.
In addition, the application further provides a readable storage medium, wherein the readable storage medium stores a computer program, and the computer program realizes the steps of the method for analyzing the structure of the portable document format file according to any one of the above steps when being executed by a processor.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for structured parsing of a portable document format file, the method comprising:
analyzing a portable document format file, and extracting metadata information, content information and page size information corresponding to the portable document format file; the content information includes: text, picture, text coordinates, and picture coordinates;
performing layout analysis on a preset picture format file corresponding to a page of the portable document format file by using a trained file intelligent analysis model, and determining a type area of the preset picture format file;
based on the page size information, the text coordinates and the picture coordinates, matching the text and the picture with the type region by using the trained file intelligent analysis model to obtain first structured data;
based on the first structured data, performing association mapping on the reference document and the reference sentence by using a regular expression and the text coordinates to obtain second structured data;
and correlating and outputting the metadata information and the second structured data to realize the structured analysis of the portable document format file.
2. The method of claim 1, wherein parsing the portable document format file to extract metadata information, content information, and page size information corresponding to the portable document format file comprises:
analyzing header file information of a portable document format file, and extracting metadata information corresponding to the portable document format file;
and analyzing the portable document format file based on an open source library, and extracting content information and page size information corresponding to the portable document format file.
3. The method according to claim 1, wherein the performing layout analysis on the preset picture format file corresponding to the page of the portable document format file using the trained intelligent file analysis model, before determining the type area of the preset picture format file, further comprises:
and converting the page of the portable document format file into a preset picture format file.
4. The method according to claim 1, wherein the performing layout analysis on the preset picture format file corresponding to the page of the portable document format file using the trained intelligent file analysis model, before determining the type area of the preset picture format file, further comprises:
constructing a sample data set;
performing layout marking on the sample data in the sample data set by using a deep learning image marking tool to obtain a model tuning training set;
and training the basic file intelligent analysis model by using the model tuning training set to obtain a trained file intelligent analysis model.
5. The method of claim 4, wherein constructing the sample dataset comprises:
acquiring a training file;
performing page-to-picture processing on the training file to obtain a training picture;
carrying out gray level conversion processing, image smoothing processing, edge detection processing and binarization preprocessing on the training picture to obtain sample data;
a sample data set is constructed using the sample data.
6. The method of claim 4, wherein training the underlying intelligent document analysis model using the model tuning training set to obtain a trained intelligent document analysis model comprises:
and training the basic file intelligent analysis model by using the model tuning training set and the intelligent document multi-mode pre-training model through a hundred-degree flying oar deep learning framework to obtain a trained file intelligent analysis model.
7. The method according to claim 1, wherein the performing layout analysis on the preset picture format file corresponding to the page of the portable document format file using the trained intelligent file analysis model, and determining the type area of the preset picture format file, includes:
and carrying out layout analysis on a preset picture format file corresponding to a page of the portable document format file by using a trained file intelligent analysis model, and determining a header area, a footer area, a title area, an author unit area, a chapter area, a paragraph area, a picture text area, a table text area, a formula area and a reference document area of the preset picture format file.
8. A structured document parsing apparatus for a portable document format file, comprising:
the analysis module is used for analyzing the portable document format file and extracting metadata information, content information and page size information corresponding to the portable document format file; the content information includes: text, picture, text coordinates, and picture coordinates;
the analysis module is used for carrying out layout analysis on a preset picture format file corresponding to a page of the portable document format file by using the trained intelligent file analysis model, and determining a type area of the preset picture format file;
the matching module is used for matching the text, the picture and the type area by utilizing the trained file intelligent analysis model based on the page size information, the text coordinates and the picture coordinates to obtain first structured data;
the mapping module is used for carrying out association mapping on the reference document and the reference sentence by utilizing a regular expression and the text coordinates based on the first structured data to obtain second structured data;
and the output module is used for correlating and outputting the metadata information and the second structured data to realize the structured analysis of the portable document format file.
9. A structured parsing apparatus for a portable document format file, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the structured parsing method of a portable document format file according to any one of claims 1 to 7 when executing said computer program.
10. A readable storage medium, wherein a computer program is stored on the readable storage medium, which when executed by a processor, implements the steps of the structured parsing method of a portable document format file according to any one of claims 1 to 7.
CN202311498326.7A 2023-11-10 2023-11-10 Structured analysis method of portable document format file and related products Pending CN117473980A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311498326.7A CN117473980A (en) 2023-11-10 2023-11-10 Structured analysis method of portable document format file and related products

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311498326.7A CN117473980A (en) 2023-11-10 2023-11-10 Structured analysis method of portable document format file and related products

Publications (1)

Publication Number Publication Date
CN117473980A true CN117473980A (en) 2024-01-30

Family

ID=89639490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311498326.7A Pending CN117473980A (en) 2023-11-10 2023-11-10 Structured analysis method of portable document format file and related products

Country Status (1)

Country Link
CN (1) CN117473980A (en)

Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104516919A (en) * 2013-09-30 2015-04-15 北大方正集团有限公司 Quoting annotation processing method and system
CN106326193A (en) * 2015-06-18 2017-01-11 北京大学 Footnote identification method and footnote and footnote citation association method in fixed-layout document
CN106886509A (en) * 2017-03-06 2017-06-23 大连理工大学 A kind of academic dissertation form automatic testing method
CN109271613A (en) * 2018-09-25 2019-01-25 四川译讯信息科技有限公司 A kind of pdf document analytic method
CN110705223A (en) * 2019-08-13 2020-01-17 北京众信博雅科技有限公司 Footnote recognition and extraction method for multi-page layout document
CN111259830A (en) * 2020-01-19 2020-06-09 中国农业科学院农业信息研究所 Method and system for fragmenting PDF document contents in overseas agriculture
CN113807158A (en) * 2020-12-04 2021-12-17 四川医枢科技股份有限公司 PDF content extraction method, device and equipment
CN114330247A (en) * 2021-11-09 2022-04-12 世纪保众(北京)网络科技有限公司 Automatic insurance clause analysis method based on image recognition
CN114359924A (en) * 2021-11-30 2022-04-15 泰康保险集团股份有限公司 Data processing method, device, equipment and storage medium
CN114663904A (en) * 2022-04-02 2022-06-24 成都卫士通信息产业股份有限公司 PDF document layout detection method, device, equipment and medium
CN114782122A (en) * 2022-03-15 2022-07-22 福建亿力电力科技有限责任公司 Automatic analysis method and system for bidder information in bidding material
CN114863408A (en) * 2021-06-10 2022-08-05 四川医枢科技有限责任公司 Document content classification method, system, device and computer readable storage medium
CN115131804A (en) * 2022-04-21 2022-09-30 腾讯科技(深圳)有限公司 Document identification method and device, electronic equipment and computer readable storage medium
CN115223182A (en) * 2022-07-14 2022-10-21 河南中原消费金融股份有限公司 Document layout identification method and related device
CN115455930A (en) * 2022-09-21 2022-12-09 联仁健康医疗大数据科技股份有限公司 Report document processing method and device, electronic equipment and storage medium
CN115512369A (en) * 2022-09-14 2022-12-23 湖南星汉数智科技有限公司 Document image layout analysis method and device, computer equipment and storage medium
CN115578741A (en) * 2022-09-14 2023-01-06 山东科技大学 Mask R-cnn algorithm and type segmentation based scanned file layout analysis method
CN116110051A (en) * 2023-04-13 2023-05-12 合肥机数量子科技有限公司 File information processing method and device, computer equipment and storage medium
CN116306487A (en) * 2023-02-23 2023-06-23 微科智检(佛山市)科技有限公司 Intelligent detection system and method for academic treatises of higher institutions
CN116740723A (en) * 2023-05-16 2023-09-12 西安电子科技大学 PDF document identification method based on open source Paddle framework
CN116824608A (en) * 2023-06-07 2023-09-29 北京工业大学 Answer sheet layout analysis method based on target detection technology
CN116909989A (en) * 2023-07-19 2023-10-20 广州启生信息技术有限公司 AI-based medical library text extraction method and device

Patent Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104516919A (en) * 2013-09-30 2015-04-15 北大方正集团有限公司 Quoting annotation processing method and system
CN106326193A (en) * 2015-06-18 2017-01-11 北京大学 Footnote identification method and footnote and footnote citation association method in fixed-layout document
CN106886509A (en) * 2017-03-06 2017-06-23 大连理工大学 A kind of academic dissertation form automatic testing method
CN109271613A (en) * 2018-09-25 2019-01-25 四川译讯信息科技有限公司 A kind of pdf document analytic method
CN110705223A (en) * 2019-08-13 2020-01-17 北京众信博雅科技有限公司 Footnote recognition and extraction method for multi-page layout document
CN111259830A (en) * 2020-01-19 2020-06-09 中国农业科学院农业信息研究所 Method and system for fragmenting PDF document contents in overseas agriculture
CN113807158A (en) * 2020-12-04 2021-12-17 四川医枢科技股份有限公司 PDF content extraction method, device and equipment
CN114863408A (en) * 2021-06-10 2022-08-05 四川医枢科技有限责任公司 Document content classification method, system, device and computer readable storage medium
CN114330247A (en) * 2021-11-09 2022-04-12 世纪保众(北京)网络科技有限公司 Automatic insurance clause analysis method based on image recognition
CN114359924A (en) * 2021-11-30 2022-04-15 泰康保险集团股份有限公司 Data processing method, device, equipment and storage medium
CN114782122A (en) * 2022-03-15 2022-07-22 福建亿力电力科技有限责任公司 Automatic analysis method and system for bidder information in bidding material
CN114663904A (en) * 2022-04-02 2022-06-24 成都卫士通信息产业股份有限公司 PDF document layout detection method, device, equipment and medium
CN115131804A (en) * 2022-04-21 2022-09-30 腾讯科技(深圳)有限公司 Document identification method and device, electronic equipment and computer readable storage medium
CN115223182A (en) * 2022-07-14 2022-10-21 河南中原消费金融股份有限公司 Document layout identification method and related device
CN115512369A (en) * 2022-09-14 2022-12-23 湖南星汉数智科技有限公司 Document image layout analysis method and device, computer equipment and storage medium
CN115578741A (en) * 2022-09-14 2023-01-06 山东科技大学 Mask R-cnn algorithm and type segmentation based scanned file layout analysis method
CN115455930A (en) * 2022-09-21 2022-12-09 联仁健康医疗大数据科技股份有限公司 Report document processing method and device, electronic equipment and storage medium
CN116306487A (en) * 2023-02-23 2023-06-23 微科智检(佛山市)科技有限公司 Intelligent detection system and method for academic treatises of higher institutions
CN116110051A (en) * 2023-04-13 2023-05-12 合肥机数量子科技有限公司 File information processing method and device, computer equipment and storage medium
CN116740723A (en) * 2023-05-16 2023-09-12 西安电子科技大学 PDF document identification method based on open source Paddle framework
CN116824608A (en) * 2023-06-07 2023-09-29 北京工业大学 Answer sheet layout analysis method based on target detection technology
CN116909989A (en) * 2023-07-19 2023-10-20 广州启生信息技术有限公司 AI-based medical library text extraction method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JOHNSON7788: "PP-StructureV2: 一个更强大的文件分析系统", Retrieved from the Internet <URL:https://zhuanlan.zhihu.com/p/577540749?utm_id=0> *
飞桨PADDLEPADDLE: "文档智能分析产业实践,基于PP-StructureV2和OpenVINO实现训练部署开发全流程", Retrieved from the Internet <URL:https://zhuanlan.zhihu.com/p/617295970> *
黄博的机器学习圈子: "炸裂!PDF转Word彻底告别收费时代,这个OCR开源项目要逆天!", pages 1, Retrieved from the Internet <URL:https://cloud.tencent.com/developer/article/2154029> *

Similar Documents

Publication Publication Date Title
CN110399457B (en) Intelligent question answering method and system
KR101376863B1 (en) Grammatical parsing of document visual structures
CN110750959B (en) Text information processing method, model training method and related device
US9304993B2 (en) Methods and data structures for multiple combined improved searchable formatted documents including citation and corpus generation
CN110609983B (en) Structured decomposition method for policy file
JP2008148322A (en) Method for processing character encoding, and system
CN110688863B (en) Document translation system and document translation method
US20220414463A1 (en) Automated troubleshooter
CN112800184A (en) Short text comment emotion analysis method based on Target-Aspect-Opinion joint extraction
CN112269872A (en) Resume analysis method and device, electronic equipment and computer storage medium
CN116521621A (en) Data processing method and device, electronic equipment and storage medium
CN113297852B (en) Medical entity word recognition method and device
Thammarak et al. Automated data digitization system for vehicle registration certificates using google cloud vision API
US20160328374A1 (en) Methods and Data Structures for Improved Searchable Formatted Documents including Citation and Corpus Generation
CN113761377A (en) Attention mechanism multi-feature fusion-based false information detection method and device, electronic equipment and storage medium
CN111881900A (en) Corpus generation, translation model training and translation method, apparatus, device and medium
CN114579796B (en) Machine reading understanding method and device
CN115759037A (en) Intelligent auditing frame and auditing method for building construction scheme
CN116383414A (en) Intelligent file review system and method based on carbon check knowledge graph
CN117473980A (en) Structured analysis method of portable document format file and related products
CN114840657A (en) API knowledge graph self-adaptive construction and intelligent question-answering method based on mixed mode
CN112613315A (en) Text knowledge automatic extraction method, device, equipment and storage medium
Vilkomir et al. Challenges of Automatic Document Processing with Historical Data
Amon Setswana grammar checker for declarative sentences using LSTM-Recurrent Neural Network
CN117131189A (en) Semantic-based open domain webpage knowledge extraction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination