CN117496545B - PDF document-oriented form data fusion processing method and device - Google Patents

PDF document-oriented form data fusion processing method and device Download PDF

Info

Publication number
CN117496545B
CN117496545B CN202410002584.XA CN202410002584A CN117496545B CN 117496545 B CN117496545 B CN 117496545B CN 202410002584 A CN202410002584 A CN 202410002584A CN 117496545 B CN117496545 B CN 117496545B
Authority
CN
China
Prior art keywords
fusion
text
initial
target
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410002584.XA
Other languages
Chinese (zh)
Other versions
CN117496545A (en
Inventor
储诚灿
朱海洋
谈旭炜
胡健
应石磊
潘洁
谢文杰
苏轶
祝玲倩
沈萍平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Products Zhongda Digital Technology Co ltd
Original Assignee
Products Zhongda Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Products Zhongda Digital Technology Co ltd filed Critical Products Zhongda Digital Technology Co ltd
Priority to CN202410002584.XA priority Critical patent/CN117496545B/en
Publication of CN117496545A publication Critical patent/CN117496545A/en
Application granted granted Critical
Publication of CN117496545B publication Critical patent/CN117496545B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

In the fusion processing method, corresponding fusion parameters are determined for an initial table contained in each PDF document in a plurality of PDF documents. And then, the fusion parameters and the table contents of each initial table are stored in the intermediate table in a correlated manner. And finally, fusing the associated table contents based on the fusion parameters of each row in the intermediate table, thereby obtaining a fusion result. Therefore, the fusion efficiency of the multi-source heterogeneous table data can be improved.

Description

PDF document-oriented form data fusion processing method and device
Technical Field
One or more embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method and an apparatus for fusion processing of form data for PDF documents.
Background
Large supply chain enterprises generally need to acquire a large amount of data from a PDF document reported by a marketing company regularly for industry benchmarking analysis, industry research and the like, so that business intelligence is provided for the company on planning choices such as market opportunities, business layout, commodity optimization, new business exploration and the like, and action path references are provided on tactical plans such as refined management, risk management and the like, thereby assisting in enterprise management promotion and value creation and accelerating the establishment of world first-class enterprises. However, the number of tables in the PDF documents is large, the positions of the pages are different, and the content space length is different, so that fusion processing of the analyzed table data in the PDF documents according to the same subject is difficult to realize quickly, the subsequent quick retrieval and comparison analysis are blocked, and the data asset and value of the PDF documents after the data is recycled are restricted. The traditional processing mode generally adopts manual reading to acquire data information required by target analysis, and then the data information is manually input into Excel or Word documents according to different topics, so that a great deal of manual investment and repeated tedious labor are often required, and continuous tracking research, target analysis and continuous deep analysis of enterprises are not facilitated.
Therefore, it is desirable to provide a more efficient method for fusion processing of form data for PDF documents.
Disclosure of Invention
One or more embodiments of the present disclosure describe a method for fusion processing of table data for PDF documents, which may improve fusion efficiency of multi-source heterogeneous table data.
In a first aspect, a method for fusion processing of form data for a PDF document is provided, including:
analyzing the PDF documents to obtain a plurality of initial tables and a plurality of pages of text contents contained in each PDF document;
converting the multi-page text content corresponding to each PDF document into a plurality of text lists, wherein a single text list comprises a plurality of lines of text;
selecting a target text list corresponding to a page of each initial table contained in each PDF document from a plurality of text lists corresponding to each PDF document;
determining fusion parameters of each initial form contained in the plurality of PDF documents; wherein, for any first initial table in the initial tables, determining the corresponding fusion parameters includes: determining a target position of the first initial table in the first target text list at least by matching a first row of the first initial table with each row of text in a corresponding first target text list; extracting key area text of the first initial form from the first target text list at least according to the target position; determining fusion parameters of the first initial form based at least on the key region text;
the fusion parameters and the data content of each initial table are used as each data row and stored in an intermediate table;
and fusing the data content in each data row based on the fusion parameters in each data row in the intermediate table to obtain a fusion result.
In a second aspect, a PDF document-oriented form data fusion processing apparatus is provided, including:
the analysis unit is used for analyzing the PDF documents to obtain a plurality of initial tables and a plurality of pages of text contents contained in each PDF document;
a conversion unit, configured to convert a plurality of pages of text content corresponding to each PDF document into a plurality of text lists, where a single text list includes a plurality of lines of text;
a selecting unit, configured to select, from a plurality of text lists corresponding to each PDF document, a target text list corresponding to a page where each initial table included in the PDF document is located;
a determining unit configured to determine fusion parameters of respective initial tables included in the plurality of PDF documents;
the determination unit includes:
the matching sub-module is used for at least matching the first line of the first initial table with each line of text in a corresponding first target text list for any first initial table in the initial tables, and determining the target position of the first initial table in the first target text list;
the extraction sub-module is used for extracting key area texts of the first initial table from the first target text list at least according to the target position;
a determining submodule, configured to determine a fusion parameter of the first initial table based at least on the key region text;
the storage unit is used for storing the fusion parameters and the data content of each initial table as each data row into the intermediate table;
and the fusion unit is used for fusing the data content in each data row based on the fusion parameters in each data row in the intermediate table to obtain a fusion result.
According to the method and the device for processing the form data fusion for the PDF documents, corresponding fusion parameters are determined for initial forms contained in each PDF document in a plurality of PDF documents. And then, the fusion parameters and the table contents of each initial table are stored in the intermediate table in a correlated manner. And finally, fusing the associated table contents based on the fusion parameters of each row in the intermediate table, thereby obtaining a fusion result. Therefore, the fusion efficiency of the multi-source heterogeneous table data can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present description, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 illustrates a flowchart of a method for a PDF document-oriented form data fusion process, in accordance with one embodiment;
FIG. 2 illustrates a method schematic diagram of determining a target array in one example;
FIG. 3 illustrates a bi-directional index positioning schematic in one example;
FIG. 4 illustrates a method schematic diagram of extracting information using an entity recognition model in one example;
FIG. 5 shows an intermediate table schematic in one example;
FIG. 6 shows a fusion table schematic in one example;
FIG. 7 shows a schematic diagram of a form data fusion processing device for PDF documents according to an embodiment.
Detailed Description
The following describes the scheme provided in the present specification with reference to the drawings.
Before describing the solutions provided by the embodiments of the present specification, the following description will be made on several concepts mentioned in the present solution.
Multisource: tables parsed from different PDF documents constitute multiple sources.
Isomerism: tables with inconsistent subject content parsed from the same PDF document constitute a heterogeneous structure.
Fusion treatment: and classifying and integrating the table data analyzed from different PDF documents according to the same subject to form a two-dimensional table structure of the relational database, so that the same index data can be conveniently and rapidly searched and compared and analyzed.
At present, fusion processing is mainly performed on multi-source heterogeneous table data in the following ways:
(1) The manual mode, namely manually picking out the form data in the PDF document to be fused one by one, is suitable for the situations of small data volume PDF documents and few sources, but is not suitable for the situations of large data volume and large sources, and in addition, the manual mode has high cost, low processing efficiency and easy error.
(2) The code mode, namely, the fusion of the table data is carried out by utilizing the code of the pre-written complex rule, and the method has low accuracy, universality and efficiency.
(3) The artificial intelligence mode is to train a form classification model in advance, classify the form data to be fused, and then fuse the data based on the classification result. The method needs to manually prepare a large number of samples and label the samples, and has the advantages of high training cost, poor universality, long period and low efficiency.
Therefore, parameters such as table subject information and position information are adopted as identification features, the table subject information and the table subject data (hereinafter also referred to as table contents) thereof are converted and associated, a two-dimensional table structure capable of being rapidly searched and compared and analyzed is generated and is stored in a database in a fusion mode, the processing efficiency is far higher than that of a traditional manual mode, a faster data fusion processing method is provided for large supply chain enterprises to collect, process and analyze data in the standard management, and the effective improvement of the fusion processing efficiency is realized, and the method is described in detail below.
FIG. 1 illustrates a flowchart of a method for a PDF document-oriented form data fusion process, in accordance with one embodiment. The method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities. As shown in fig. 1, the method may include the following steps.
And step S102, analyzing the PDF documents to obtain a plurality of initial tables and a plurality of pages of text contents contained in each PDF document.
The multiple PDF documents may each belong to different enterprises and/or the same enterprise at different times.
In one embodiment, each PDF document may be parsed using a Python-based open source tool (pdfplumbber) to obtain text content for a plurality of initial tables and pages contained in the PDF document (abbreviated as multi-page text content). It should be appreciated that the text content of the plurality of pages includes the text content of the page on which the initial form is located.
Step S104, converting the multi-page text content corresponding to each PDF document into a plurality of text lists, wherein a single text list comprises a plurality of lines of text.
In one embodiment, for any page of text content, the page of text content may be cut into multiple lines of text according to a line feed, and then the multiple lines of text may be arranged into a list form, so that a corresponding text list may be obtained. In a more specific embodiment, the text list may also indicate an Index (Index), type (Type), size (Size), etc. of each line of text.
In one example, the text list described above may be as shown in table 1.
TABLE 1
Index Type(s) Size and dimensions of Numerical value
0 str 1 Year 2021 report
1 str 1 Cash 4,392,059,024.83-1,389,620,263.94 discomfort from financingBy using
2 str 1 Flow rate purifying amount
3 str 1 Business income change reason explanation: the method is mainly influenced by the rising of bulk market price in the present period, and the sales scale of the company is increased.
4 str 1 Description of reasons for business cost variation: the method is mainly influenced by the rising of bulk market price in the present period, and the sales scale of the company is increased.
5 str 1 Sales cost variation cause description: mainly caused by the increase of sales scale and related cost in the present period.
In Table 1, the text list includes an Index (Index) column, a Type (Type) column, a Size (Size) column, and a Value (Value) column. Wherein the content of the index column is a text identification, which may be numbered starting from 0. The content of the type column is the data type of the text, such as a string (string) or the like. The content of the size column is the number of character strings contained in the text. The content of the numerical column is text (also called character string).
And S106, selecting a target text list corresponding to the page of each initial table contained in each PDF document from a plurality of text lists corresponding to each PDF document.
As described above, the text content for each page in each PDF document is converted to a corresponding text list. Here, a target text list obtained by converting text contents of a page where an initial form is located is extracted.
Step S108, determining fusion parameters of each initial form contained in the plurality of PDF documents.
Since the determination method of the fusion parameters of the initial tables is similar, a description will be given below of the determination method of the corresponding fusion parameters by taking any initial table as an example.
In one embodiment, the fusion parameters described above may include three general categories: attribute, location information, and subject information. The attributes may include a file name of the PDF document, an enterprise name of an enterprise to which the PDF document belongs, a date of index statistics, and the like. The location information may include a page number of a page where the initial table is located, a table order of the initial table in the page where the initial table is located, a target array (including a start line and a stop line) corresponding to the initial table, and the like. The topic information includes key region text (key region for short), form title, measurement unit, and currency unit.
First, regarding the above-mentioned attribute, it is generally possible to extract it directly from the target text list corresponding to the initial form.
In addition, the page number and the table order in the above position information can be obtained by analyzing the page where the initial table is located. The method for acquiring the target array in the above-described position information is described below.
FIG. 2 illustrates a method schematic of determining a target array in one example. In fig. 2, the first line of the initial table and the corresponding target text list are subjected to bidirectional index positioning (to be described later) so as to obtain a position array (hereinafter referred to as a first position array) of the first line of the initial table corresponding to the target text list, where the first position array includes each position number of each line of text in the target text list, where each position number matches each content in the first line of the initial table. In the case where each position number in the first position array is a continuous number series, the first position array is directly determined as the target array. Otherwise, any row (non-null) except the first row in the initial table and the corresponding target text list are subjected to bidirectional index positioning to obtain a position array (hereinafter referred to as a second position array) of the any row corresponding to the target text list, and a sub-array is selected from the first position array to serve as the target array based on the second position array.
In one embodiment, the selecting a sub-array from the first position array includes: splitting the first position array into a plurality of subarrays, wherein each position number in each subarray is a continuous number series; and determining the subarray, of which the corresponding average value is closest to the second average value corresponding to the second position array and is smaller than the second average value, in the plurality of subarrays as the selected subarray.
For example, assume that the first location array is: 7,8,9,23,24,25, then two child arrays can be split for the first position array: [7,8,9] and [23,24,25], which each correspond to an average value of: 8 and 24; then, in the second position array: [15,16,17], i.e. the second mean value is 16, the subarray may be: [7,8,9] is determined as the selected child array. Because 8 and 24 are each 8 different from 16, but 8 is less than 16, a child array is selected: [7,8,9]. And in the second position array: [30,31,32], i.e., the second average value is 31, the subarray may be: [23,24,25] is determined as the selected subarray. Because 8 differs from 31 by 23 and 24 differs from 31 by 7, 23 is greater than 7, a child array is selected: [23,24,25].
It should be appreciated that the selected target array includes a number of position numbers such that the smallest position number may be the start row and the largest position number may be the end row.
FIG. 3 illustrates a bi-directional index positioning schematic in one example. In fig. 3, each item of content in the top line of the table may be matched to a line of text of the target text list, so that the position array of the top line corresponding to the target text list may be determined based on the position number (i.e., line number) of the matched text.
It should be noted that, in the present solution, any row of the initial table is used as a secondary confirmation mechanism for the abnormal situation, so that the positioning efficiency and accuracy of the initial table can be improved, and in practical application, the bidirectional index positioning can also be performed based on all rows of the initial table, which is not limited in the present solution.
Finally, regarding the above-mentioned subject information, the form title, the unit of measure, and the unit of currency in the subject information may be extracted from the text of the key area, and thus, the method for acquiring the text of the key area will be described first.
Specifically, for each initial table, after determining a corresponding target array for the initial table, a corresponding statistic may be calculated based on each position number therein, and the statistic is taken as a target position of the initial table in the target text list. Then, the key region text of the initial form is extracted from the target text list based at least on the target location.
The statistics may be, for example, maximum, minimum, mode, average, median, or the like.
In one embodiment, the target location is a start line of the initial table in the target text list, so that the step of extracting the text of the key area may include: and extracting a preset line number text starting from the initial line from the target text list as a key area text of the initial table.
Of course, in practical application, after the statistics are calculated, the target position may be determined by combining a predefined initial position policy set, which is not described herein.
It should be noted that the key region text extracted by this scheme may be regarded as a sub-text list, which is presented in the form of a string of one line (i.e. includes a plurality of lines of text), without obvious text paragraph marks. For this key region text, the information such as the form title may be preprocessed before extracting it therefrom.
In one example, the preprocessing may include obtaining a start position and an end position of each line of text, and determining whether two adjacent lines belong to the same paragraph by using a heuristic algorithm. For the case that the text cannot be identified as a paragraph, multiple lines of text are combined first, and then are segmented by using common sentence ending characters to divide the text into multiple sentences, so that the text in the key area can be processed into the form of the paragraph or sentence. And then, extracting information such as form titles, measurement units, currency units and the like from the preprocessed key region text by utilizing a predefined rule set or entity recognition model. The table title is used as the identification feature of the fusion and storage of the subsequent multi-source heterogeneous table. The units of measure and the units of currency are used to ensure comparability of the multi-source heterogeneous table data.
In one embodiment, the predefined rule set may be, for example: located in "units: the "symbol is followed by and preceded by a" coin "symbol by a unit of measure (e.g.," one hundred million "); located after the "coin type" symbol is a unit of currency (e.g., "renminbi"), etc.
In another embodiment, the entity recognition model is trained by using the Bert-BiLSTM-CRF method, and the extraction of information such as form titles by using the model is described below.
FIG. 4 illustrates a method schematic diagram of extracting information using an entity recognition model in one example. In fig. 4, for a text sentence or paragraph obtained by preprocessing the text of the key region, it is first segmented, and each word is converted into a corresponding BERT (a pre-trained model) word vector. Next, context information is captured using a BiLSTM (two-way long and short memory network), and then the output of the BiLSTM is input as a feature to a CRF (a basic model of natural language processing) layer. The CRF layer models transition probabilities between entity tags by defining a transition matrix and decodes using Viterbi algorithm to find the most probable entity sequence to obtain entity information in a sentence or paragraph. Here, the entity information includes a form title, a unit of measure, a unit of currency, and the like.
Thus, all the fusion parameters of the initial table are obtained. In practical applications, the fusion parameters may include only one or two types of information in the three types of information, or the content in each type of information may be fewer or more, for example, the attribute may include only a file name and an enterprise name, which are not repeated herein.
Step S110, the fusion parameters and the data content of each initial table are used as each data row and stored in the intermediate table.
This step can also be understood as storing the fusion parameters of the initial table and the data content in association in a database.
It should be appreciated that each row in the intermediate table described above corresponds to an initial table, and each row includes the fusion parameters and table contents of the corresponding initial table.
FIG. 5 shows an intermediate table schematic in one example. In fig. 5, the contents in the intermediate table are divided into four parts, wherein the first three parts are fusion parameters (including attributes, location information, and subject information), and the last part is table contents (also referred to as body data or table data). The first behavior in fig. 5 fuses the names of the parameters and the table contents, and the second behavior fuses the data types of the parameters and the table contents. The third behavior extracts the values of the fusion parameters and table contents of the initial table 1 from the document 1, the fourth behavior extracts the values of the fusion parameters and table contents of the initial table 2 from the document 2, and the fifth behavior extracts the values of the fusion parameters and table contents of the initial table 3 from the document 3.
Step S112, based on the fusion parameters in each data row in the intermediate table, fusing the data content in each data row to obtain a fusion result.
It should be noted that, in the embodiment of the present disclosure, a single initial table is used to record a certain type of subject index and index value, and the table title in the above fusion parameter may indicate the index category.
Taking fig. 5 as an example, the initial table 1 of the third row is used for recording "income" index, the initial table 2 of the fourth row is used for recording "cost" index, and the initial table 3 of the fifth row is used for recording "sales" index.
In one embodiment, fusing the data content in each data row includes: and grouping the data contents in each data row according to the index category corresponding to each data row. And filling the index and the index value contained in the data content of each group into the corresponding fusion table, so as to realize the fusion of each initial table according to the group to which the initial table belongs, thereby obtaining a plurality of fusion tables, and determining the fusion tables as fusion results.
For example, assuming that in fig. 5, the initial table 1 of the third row and the initial table 2 of the fourth row are used to record the "income" index, and the initial table 3 of the fifth row is used to record the "income" index, the data contents of the initial table 1 and the initial table 2 may be divided into one group, the index and the index value contained in the data contents of the group are filled into one fusion table, the data contents of the initial table 3 are separately divided into one group, and the index contained in the data contents of the group are filled into the other fusion table.
Of course, in practical application, the index and the index value in each initial table may be filled into the fusion table, and the enterprise name, date, measurement unit, currency unit, and the like may be filled into the fusion table, which is not limited in this specification.
FIG. 6 shows a fusion table schematic in one example. In fig. 6, the fusion table may include information such as a business name, date, index name, numerical value (i.e., index value), unit of measure, and unit of currency.
In addition, as described above, the fusion parameters are generally flexibly set, for example, when a table title (corresponding to an abnormal situation) is not included therein, it cannot be classified according to the table title, so that the table fusion can be performed by a clustering method, which will be described in detail below.
In another embodiment, fusing the data content in each data row includes: and clustering the data content in each data row by calculating the similarity of the target parameters in each data row to obtain a plurality of class clusters. And filling the index and the index value contained in the data content in each class cluster into the corresponding fusion table, so as to realize the fusion of each initial table according to the class cluster to which the initial table belongs, thereby obtaining a plurality of fusion tables, and determining the fusion tables as fusion results.
The target parameter may be, for example, a key region text, a file name, etc. in the fusion parameter. In addition, the target parameters may also include header data or a first column in the table content, and so on.
Of course, in practical application, besides calculating the similarity of the target parameters, the data content in each data row may be clustered by combining the position information in the fusion parameters, which is not limited in this specification.
Furthermore, for the fused table in the above-described another embodiment, it is also possible to extract a topic phrase from the key region text as its table title by using an LDA model (a document topic generation model).
Finally, it should be further noted that the present solution may also perform a normalization operation on each index in the fusion table. Specifically, the similarity between the newly filled index and the standard index name (standard name for short) in the fusion table can be calculated, and unified standardization is implemented on the name of the index according to the similarity and the numerical type of the index, so that the index can be quickly searched and compared and analyzed.
It should be understood that the various indices may be defined by standard designations through the normalization operations described above.
By combining the above, the scheme provides the following innovation points:
(1) Acquiring a key area text of a form by a bidirectional index positioning method, and realizing the association with form main body information, wherein on one hand, identification features are provided for the fusion of the follow-up form, and on the other hand, the unified comparison standard of the fused form is ensured; (2) Providing two methods for extracting topic information of the form, and identifying rules and entities; (3) Providing a multi-source heterogeneous table with different table titles and a two-dimensional table structure to be stored in a database by taking parameters such as topic information, position information and the like of the table as characteristic identifiers; (4) Providing the multi-source heterogeneous table data which are fused by table position information, key region similarity, table head and/or first column similarity, file name similarity and the like of the table data and storing the multi-source heterogeneous table data into a database; (5) The standardization and unification of the indexes are realized by using a standard index library and index similarity, and the quick retrieval and the comparative analysis of the indexes are supported; (6) And an automatic framework for realizing the recycling, the asset and the value of the fused table data in the multi-source heterogeneous PDF document is provided.
In summary, the method acquires the position of an initial table in a text list through bidirectional index positioning, extracts the text of a key area of the table in combination with the experience parameters of the key area of the table, formulates extraction rules or entity recognition model algorithms of the table title, the measurement unit and the currency unit, stores the analyzed and converted table in PDF documents of a plurality of sources into a database in a structure form of a two-dimensional list according to the attribute, the position information, the subject information and the main body data, realizes the association and matching of the subject information and the main body information of the table, fuses the heterogeneous standard tables of the plurality of sources according to the parameters such as the subject information and the position information of the table as identification characteristics, fuses the abnormal table according to the position information of the table, the similarity of the key area, the table head and/or the first similarity of the table data and the similarity of the file name, and stores the two-dimensional table structure into the database to support rapid retrieval and comparison analysis.
Corresponding to the above-mentioned method for processing the table data fusion for the PDF document, an embodiment of the present disclosure further provides a device for processing the table data fusion for the PDF document, as shown in fig. 7, where the device may include:
the parsing unit 702 is configured to parse the PDF documents to obtain a plurality of initial tables and a plurality of pages of text contents contained in each PDF document.
And a conversion unit 704, configured to convert the multi-page text content corresponding to each PDF document into a plurality of text lists, where a single text list includes a plurality of lines of text.
And a selecting unit 706, configured to select, from a plurality of text lists corresponding to each PDF document, a target text list corresponding to a page where each initial table included in the PDF document is located.
A determining unit 708 for determining fusion parameters of the respective initial forms included in the plurality of PDF documents.
The determination unit 708 includes:
a matching sub-module 7082, configured to, for any first initial table in each initial table, determine, at least by matching a top line of the first initial table with each line of text in the corresponding first target text list, a target position of the first initial table in the first target text list;
an extraction sub-module 7084, configured to extract, from the first target text list, the key region text of the first initial form at least according to the target location;
a determination sub-module 7086 is configured to determine fusion parameters of the first initial form based at least on the key region text.
And the storage unit 710 is configured to store the fusion parameters and the data contents of each initial table as each data row in the intermediate table.
And the fusion unit 712 is configured to fuse the data content in each data row based on the fusion parameters in each data row in the intermediate table, so as to obtain a fusion result.
In one embodiment, the matching submodule 7082 is specifically configured to:
each item of content of the first line of the first initial table is matched with each line of text in the first target text list respectively to determine each first position number corresponding to each item of content of the first line, and each first position number forms a first position array;
when each first position number is a continuous number series, calculating a first statistics based on each first position number, and taking the first statistics as a target position;
in the case where each first position number is a non-continuous number series, a second position array is determined based on an arbitrary row of the first initial table, a target sub-array is determined from the first position arrays based on the second position array, and a second statistic is calculated as a target position based on each first position number in the target sub-array.
In one embodiment, the matching submodule 7082 is also specifically configured to:
splitting the first position array into a plurality of subarrays, wherein each first position number in each subarray is a continuous number series;
and determining a sub array which is closest to the second average value of the second position array and smaller than the second average value in the plurality of sub arrays as a target sub array.
In one embodiment, the first statistics or the second statistics include any one of the following: maximum, minimum, mode, mean and median.
In one embodiment, the target location is a start line; the extraction sub-module 7084 is specifically configured to:
and extracting a preset line number of texts from the first target text list, starting from the initial line, as key region texts of the first initial table.
In one embodiment, a single initial table is used to record a certain class of subject indexes and index values, and the fusion parameters include a table title, where the table title indicates an index class;
the fusion unit 712 specifically is configured to:
grouping the data content in each data row according to the index category corresponding to each data row;
filling the index and index value contained in the data content of each group into the corresponding fusion table, so as to realize the fusion of each initial table according to the group, thereby obtaining a plurality of fusion tables;
a plurality of fusion tables are determined as fusion results.
In another embodiment, a single initial table is used to record a certain class of subject indicators and indicator values;
the fusion unit 712 specifically is configured to:
clustering the data content in each data row based on the fusion parameters in each data row to obtain a plurality of class clusters;
filling indexes and index values contained in the data content in each class cluster into corresponding fusion tables, so that the initial tables are fused according to the class clusters, and a plurality of fusion tables are obtained;
a plurality of fusion tables are determined as fusion results.
In one embodiment, the apparatus further comprises:
a normalization unit 714 for performing a normalization operation on each index in the fusion table, the normalization operation being used to define that each index is named by a standard name.
In one embodiment, the above fusion parameters include attributes, location information, and subject information. The attributes comprise file names, enterprise names and dates, the position information comprises page numbers, table sequences, start lines and stop lines, and the theme information comprises key area texts, table titles, measurement units and currency units.
The functions of the functional units of the apparatus in the foregoing embodiments of the present disclosure may be implemented by the steps of the foregoing method embodiments, so that the specific working process of the apparatus provided in one embodiment of the present disclosure is not repeated herein.
According to the table data fusion processing device for PDF documents, fusion efficiency of multi-source heterogeneous table data can be improved.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.
The steps of a method or algorithm described in connection with the disclosure herein may be embodied in hardware, or may be embodied in software instructions executed by a processor. The software instructions may be comprised of corresponding software modules that may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. In addition, the ASIC may reside in a server. The processor and the storage medium may reside as discrete components in a server.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the present invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The foregoing detailed description of the embodiments has further described the objects, technical solutions and advantages of the present specification, and it should be understood that the foregoing description is only a detailed description of the embodiments of the present specification, and is not intended to limit the scope of the present specification, but any modifications, equivalents, improvements, etc. made on the basis of the technical solutions of the present specification should be included in the scope of the present specification.

Claims (8)

1. A form data fusion processing method for PDF documents comprises the following steps:
analyzing the PDF documents to obtain a plurality of initial tables and a plurality of pages of text contents contained in each PDF document;
converting the multi-page text content corresponding to each PDF document into a plurality of text lists, wherein a single text list comprises a plurality of lines of text;
selecting a target text list corresponding to a page of each initial table contained in each PDF document from a plurality of text lists corresponding to each PDF document;
determining fusion parameters of each initial form contained in the plurality of PDF documents; wherein, for any first initial table in the initial tables, determining the corresponding fusion parameters includes: determining a target position of the first initial table in the first target text list at least by matching a first row of the first initial table with each row of text in a corresponding first target text list; extracting key area text of the first initial form from the first target text list at least according to the target position; determining fusion parameters of the first initial form based at least on the key region text;
the fusion parameters and the data content of each initial table are used as each data row and stored in an intermediate table;
based on the fusion parameters in each data row in the intermediate table, fusing the data content in each data row to obtain a fusion result;
the determining the target position of the first initial form in the first target text list includes:
each item of content of a first line of the first initial table is matched with each line of text in the first target text list respectively to determine each first position number corresponding to each item of content of the first line, and each first position number forms a first position array;
calculating a first statistics based on the respective first position numbers and taking the first statistics as the target positions, in the case where the respective first position numbers are consecutive number sequences;
determining a second position array based on any row of the first initial table under the condition that each first position number is a discontinuous array, determining a target subarray from the first position array based on the second position array, and calculating second statistic as the target position based on each first position number in the target subarray;
the fusion parameters comprise attributes, position information and theme information, wherein the attributes comprise file names, enterprise names and dates; the position information comprises page numbers, table sequences, starting lines and ending lines; the subject information includes key region text, form title, unit of measure, and unit of currency.
2. The method of claim 1, wherein the determining a target subarray from the first array of locations comprises:
splitting the first position array into a plurality of subarrays, wherein each first position number in each subarray is a continuous number series;
and determining a sub-array, of the plurality of sub-arrays, of which the corresponding average value is closest to the second average value of the second position array and is smaller than the second average value, as the target sub-array.
3. The method of claim 1, wherein the first or second statistics comprise any of: maximum, minimum, mode, mean and median.
4. The method of claim 1, wherein the target location is a start row;
the extracting the key area text of the first initial table from the first target text list comprises the following steps:
and extracting a preset line number text starting from the initial line from the first target text list as a key area text of the first initial table.
5. The method of claim 1, wherein a single initial table is used to record a certain class of subject indicators and indicator values; the fusion parameters comprise a table title, wherein the table title indicates an index category;
the fusing the data content in each data row comprises the following steps:
grouping the data content in each data row according to the index category corresponding to each data row;
filling the index and index value contained in the data content of each group into a corresponding fusion table, so as to realize the fusion of the initial tables according to the groups, thereby obtaining a plurality of fusion tables;
and determining the fusion tables as the fusion results.
6. The method of claim 1, wherein a single initial table is used to record a certain class of subject indicators and indicator values;
the fusing the data content in each data row comprises the following steps:
clustering the data content in each data row based on the fusion parameters in each data row to obtain a plurality of class clusters;
filling indexes and index values contained in the data content in each class cluster into corresponding fusion tables, so as to realize fusion of the initial tables according to the class clusters, and further obtain a plurality of fusion tables;
and determining the fusion tables as the fusion results.
7. The method of claim 1, further comprising:
and executing a standardized operation on each index in the fusion table, wherein the standardized operation is used for limiting each index to be named according to a standard name.
8. A form data fusion processing device for PDF documents comprises:
the analysis unit is used for analyzing the PDF documents to obtain a plurality of initial tables and a plurality of pages of text contents contained in each PDF document;
a conversion unit, configured to convert a plurality of pages of text content corresponding to each PDF document into a plurality of text lists, where a single text list includes a plurality of lines of text;
a selecting unit, configured to select, from a plurality of text lists corresponding to each PDF document, a target text list corresponding to a page where each initial table included in the PDF document is located;
a determining unit configured to determine fusion parameters of respective initial tables included in the plurality of PDF documents;
the determination unit includes:
the matching sub-module is used for at least matching the first line of the first initial table with each line of text in a corresponding first target text list for any first initial table in the initial tables, and determining the target position of the first initial table in the first target text list;
the extraction sub-module is used for extracting key area texts of the first initial table from the first target text list at least according to the target position;
a determining submodule, configured to determine a fusion parameter of the first initial table based at least on the key region text;
the storage unit is used for storing the fusion parameters and the data content of each initial table as each data row into the intermediate table;
the fusion unit is used for fusing the data content in each data row based on the fusion parameters in each data row in the intermediate table to obtain a fusion result;
the matching submodule is specifically used for:
each item of content of a first line of the first initial table is matched with each line of text in the first target text list respectively to determine each first position number corresponding to each item of content of the first line, and each first position number forms a first position array;
calculating a first statistics based on the respective first position numbers and taking the first statistics as the target positions, in the case where the respective first position numbers are consecutive number sequences;
determining a second position array based on any row of the first initial table under the condition that each first position number is a discontinuous array, determining a target subarray from the first position array based on the second position array, and calculating second statistic as the target position based on each first position number in the target subarray;
the fusion parameters comprise attributes, position information and theme information, wherein the attributes comprise file names, enterprise names and dates; the position information comprises page numbers, table sequences, starting lines and ending lines; the subject information includes key region text, form title, unit of measure, and unit of currency.
CN202410002584.XA 2024-01-02 2024-01-02 PDF document-oriented form data fusion processing method and device Active CN117496545B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410002584.XA CN117496545B (en) 2024-01-02 2024-01-02 PDF document-oriented form data fusion processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410002584.XA CN117496545B (en) 2024-01-02 2024-01-02 PDF document-oriented form data fusion processing method and device

Publications (2)

Publication Number Publication Date
CN117496545A CN117496545A (en) 2024-02-02
CN117496545B true CN117496545B (en) 2024-03-15

Family

ID=89669459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410002584.XA Active CN117496545B (en) 2024-01-02 2024-01-02 PDF document-oriented form data fusion processing method and device

Country Status (1)

Country Link
CN (1) CN117496545B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130096004A (en) * 2012-02-21 2013-08-29 한국과학기술원 Automatic table classification method and system based on information in table within document
CN110795919A (en) * 2019-11-07 2020-02-14 达而观信息科技(上海)有限公司 Method, device, equipment and medium for extracting table in PDF document
CN113961685A (en) * 2021-07-13 2022-01-21 北京金山数字娱乐科技有限公司 Information extraction method and device
WO2022105172A1 (en) * 2020-11-17 2022-05-27 平安科技(深圳)有限公司 Pdf document cross-page table merging method and apparatus, electronic device and storage medium
CN114821612A (en) * 2022-05-30 2022-07-29 浙商期货有限公司 Method and system for extracting information of PDF document in securities future scene
WO2023138023A1 (en) * 2022-01-18 2023-07-27 深圳前海环融联易信息科技服务有限公司 Multimodal document information extraction method based on graph neural network, device and medium
CN116975626A (en) * 2023-06-09 2023-10-31 浙江大学 Automatic updating method and device for supply chain data model
CN117095419A (en) * 2023-08-25 2023-11-21 上海数珩信息科技股份有限公司 PDF document data processing and information extracting device and method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10242257B2 (en) * 2017-05-18 2019-03-26 Wipro Limited Methods and devices for extracting text from documents
CN108470021B (en) * 2018-03-26 2022-06-03 阿博茨德(北京)科技有限公司 Method and device for positioning table in PDF document
CN108446264B (en) * 2018-03-26 2022-02-15 阿博茨德(北京)科技有限公司 Method and device for analyzing table vector in PDF document

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130096004A (en) * 2012-02-21 2013-08-29 한국과학기술원 Automatic table classification method and system based on information in table within document
CN110795919A (en) * 2019-11-07 2020-02-14 达而观信息科技(上海)有限公司 Method, device, equipment and medium for extracting table in PDF document
WO2022105172A1 (en) * 2020-11-17 2022-05-27 平安科技(深圳)有限公司 Pdf document cross-page table merging method and apparatus, electronic device and storage medium
CN113961685A (en) * 2021-07-13 2022-01-21 北京金山数字娱乐科技有限公司 Information extraction method and device
WO2023138023A1 (en) * 2022-01-18 2023-07-27 深圳前海环融联易信息科技服务有限公司 Multimodal document information extraction method based on graph neural network, device and medium
CN114821612A (en) * 2022-05-30 2022-07-29 浙商期货有限公司 Method and system for extracting information of PDF document in securities future scene
CN116975626A (en) * 2023-06-09 2023-10-31 浙江大学 Automatic updating method and device for supply chain data model
CN117095419A (en) * 2023-08-25 2023-11-21 上海数珩信息科技股份有限公司 PDF document data processing and information extracting device and method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"The ITER Remote Maintenance Management System";Tesini, A (Tesini, Alessandro) ; Rolfe, AC (Rolfe, A. C.);<FUSION ENGINEERING AND DESIGN>;20090630;第84卷(第2-6期);全文 *
"基于多特征融合的网页表格数据提取方法";马佳芸,杨林峰;《工业控制计算机》;20221130;第35卷(第11期);全文 *
PDF文档表格信息的识别与提取;田翠华;张一平;胡志钢;高静敏;李西雨;;《厦门理工学院学报》;20200630(第03期);全文 *
Tesini, A (Tesini, Alessandro) *

Also Published As

Publication number Publication date
CN117496545A (en) 2024-02-02

Similar Documents

Publication Publication Date Title
US20210382878A1 (en) Systems and methods for generating a contextually and conversationally correct response to a query
Inzalkar et al. A survey on text mining-techniques and application
EP1736901B1 (en) Method for classifying sub-trees in semi-structured documents
CN108733748B (en) Cross-border product quality risk fuzzy prediction method based on commodity comment public sentiment
CN109145260B (en) Automatic text information extraction method
US20120041955A1 (en) Enhanced identification of document types
CN108319583B (en) Method and system for extracting knowledge from Chinese language material library
US20230028664A1 (en) System and method for automatically tagging documents
CN115438195A (en) Construction method and device of knowledge graph in financial standardization field
CN113987175B (en) Text multi-label classification method based on medical subject vocabulary enhancement characterization
CN117874206A (en) Query method for natural language identification and Chinese word segmentation of high-efficiency data asset based on large model
CN116629258B (en) Structured analysis method and system for judicial document based on complex information item data
CN117496545B (en) PDF document-oriented form data fusion processing method and device
CN114022086B (en) Purchasing method, device, equipment and storage medium based on BOM identification
CN113254583B (en) Document marking method, device and medium based on semantic vector
Braunschweig Recovering the semantics of tabular web data
Li et al. Table classification using both structure and content information: A case study of financial documents
Assaf et al. RUBIX: a framework for improving data integration with linked data
CN116932487B (en) Quantized data analysis method and system based on data paragraph division
CN117454851B (en) PDF document-oriented form data extraction method and device
CN118331993B (en) Data screening method based on big data platform
Khashfeh et al. A Text Mining Algorithm Optimising the Determination of Relevant Studies
US20240046039A1 (en) Method for News Mapping and Apparatus for Performing the Method
Yau et al. Detection of topic on health news in twitter data
Vilenius Unsupervised Learning of News Tags

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant