WO2022105172A1 - Pdf文档跨页表格合并方法、装置、电子设备及存储介质 - Google Patents

Pdf文档跨页表格合并方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2022105172A1
WO2022105172A1 PCT/CN2021/096636 CN2021096636W WO2022105172A1 WO 2022105172 A1 WO2022105172 A1 WO 2022105172A1 CN 2021096636 W CN2021096636 W CN 2021096636W WO 2022105172 A1 WO2022105172 A1 WO 2022105172A1
Authority
WO
WIPO (PCT)
Prior art keywords
page
position information
data
cross
deep learning
Prior art date
Application number
PCT/CN2021/096636
Other languages
English (en)
French (fr)
Inventor
王文浩
徐国强
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022105172A1 publication Critical patent/WO2022105172A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/174Form filling; Merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application relates to the technical field of text processing in artificial intelligence, and in particular to a method, device, electronic device and storage medium for merging PDF documents across pages and tables.
  • the PDF format is widely used in the storage and transmission of various files, and it is often necessary to extract information from PDF documents. Since tables often appear in PDF documents, the inventor found that because there is no table format in the PDF document format, the table obtained after parsing the PDF document has only text and image lines. When the bottom of a page and the top of the next page in the PDF document appear at the same time When a table is used, it is necessary to judge whether it is the same table.
  • the cross-page table merging in a PDF document mainly uses rules to determine whether the two tables of the two-page spread contain the same number of columns. For complex tables that span pages, the rule method cannot play a good judgment effect.
  • a first aspect of the present application provides a method for merging tables across pages in a PDF document, the method comprising:
  • each table in the described table data set randomly select a row in each described table to divide, obtain the position information of the upper half block of each described table and the position information of the lower half block of each described table.
  • Position information merge the position information of the upper half block of each table and the position information of the lower half block of each table to obtain positive sample data, mark the positive sample data as the first mark, randomly select
  • the position information of the upper half block of each table and the position information of the upper half block of other tables obtain negative sample data, and the negative sample data is marked as the second mark, and the positive sample data is the same as the negative sample data.
  • the sample data forms sample training data, and the sample training data and corresponding annotations form a cross-page table training data set;
  • Construct a deep learning model based on the pre-training model of the deep bidirectional converter construct the input data of the deep learning model according to the cross-page table training data set, and place each of the tables in the cross-page table training data set.
  • the cell is used as the step size of the input of the deep learning model, and the two-category prediction value corresponding to each sample training data in the cross-page table training data set is marked as the output of the deep learning model, and the training and optimization of the Deep learning model, get table merge model;
  • Obtain the PDF test document collect the text information and position information of each page in the PDF test document, remove the text information and position information of the header and footer of each page in the PDF test document, according to each
  • the position information of the page determines whether there is a table at the bottom and the top of each page. When there is a table at the bottom of the page and the top of the next page of the page, the position information of the table at the bottom of the page is merged with the bottom of the page.
  • the position information of the table at the top of a page, and the combined result is used as the test data for the cross-page table;
  • the table merging model is used to predict and obtain a two-category prediction value, and the two-category prediction value is used to determine whether the cross-page table test data needs to be merged;
  • the table at the bottom of the page and the table at the top of the next page are combined to obtain a result table, and the result table is displayed according to an instruction.
  • a second aspect of the present application provides an electronic device comprising a memory and a processor, the memory being used to store at least one computer-readable instruction, and the processor being configured to execute the at least one computer-readable instruction to Implement the following steps:
  • each table in the described table data set randomly select a row in each described table to divide, obtain the position information of the upper half block of each described table and the position information of the lower half block of each described table.
  • Position information merge the position information of the upper half block of each table and the position information of the lower half block of each table to obtain positive sample data, mark the positive sample data as the first mark, randomly select
  • the position information of the upper half block of each table and the position information of the upper half block of other tables obtain negative sample data, and the negative sample data is marked as the second mark, and the positive sample data is the same as the negative sample data.
  • the sample data forms sample training data, and the sample training data and corresponding annotations form a cross-page table training data set;
  • Construct a deep learning model based on the pre-training model of the deep bidirectional converter construct the input data of the deep learning model according to the cross-page table training data set, and place each of the tables in the cross-page table training data set.
  • the cell is used as the step size of the input of the deep learning model, and the two-category prediction value corresponding to each sample training data in the cross-page table training data set is marked as the output of the deep learning model, and the training and optimization of the Deep learning model, get table merge model;
  • Obtain the PDF test document collect the text information and position information of each page in the PDF test document, remove the text information and position information of the header and footer of each page in the PDF test document, according to each
  • the position information of the page determines whether there is a table at the bottom and the top of each page. When there is a table at the bottom of the page and the top of the next page of the page, the position information of the table at the bottom of the page is merged with the bottom of the page.
  • the position information of the table at the top of a page, and the combined result is used as the test data for the cross-page table;
  • the table merging model is used to predict and obtain a two-category prediction value, and the two-category prediction value is used to determine whether the cross-page table test data needs to be merged;
  • the table at the bottom of the page and the table at the top of the next page are merged to obtain a result table, and the result table is displayed according to an instruction.
  • a third aspect of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores at least one computer-readable instruction, and when the at least one computer-readable instruction is executed by a processor, implements the following steps:
  • each table in the described table data set randomly select a row in each described table to divide, obtain the position information of the upper half block of each described table and the position information of the lower half block of each described table.
  • Position information merge the position information of the upper half block of each table and the position information of the lower half block of each table to obtain positive sample data, mark the positive sample data as the first mark, randomly select
  • the position information of the upper half block of each table and the position information of the upper half block of other tables obtain negative sample data, and the negative sample data is marked as the second mark, and the positive sample data is the same as the negative sample data.
  • the sample data forms sample training data, and the sample training data and corresponding annotations form a cross-page table training data set;
  • Construct a deep learning model based on the pre-training model of the deep bidirectional converter construct the input data of the deep learning model according to the cross-page table training data set, and place each of the tables in the cross-page table training data set.
  • the cell is used as the step size of the input of the deep learning model, and the two-category prediction value corresponding to each sample training data in the cross-page table training data set is marked as the output of the deep learning model, and the training and optimization of the Deep learning model, get table merge model;
  • Obtain the PDF test document collect the text information and position information of each page in the PDF test document, remove the text information and position information of the header and footer of each page in the PDF test document, according to each
  • the position information of the page determines whether there is a table at the bottom and the top of each page. When there is a table at the bottom of the page and the top of the next page of the page, the position information of the table at the bottom of the page is merged with the bottom of the page.
  • the position information of the table at the top of a page, and the combined result is used as the test data for the cross-page table;
  • the table merging model is used to predict and obtain a two-category prediction value, and the two-category prediction value is used to determine whether the cross-page table test data needs to be merged;
  • the table at the bottom of the page and the table at the top of the next page are merged to obtain a result table, and the result table is displayed according to an instruction.
  • a fourth aspect of the present application provides a PDF document cross-page table merging device, the device comprising:
  • a table data acquisition module configured to acquire at least two PDF documents containing tables, and collect position information and text information of at least one table in each of the PDF documents, and obtain a table data set according to the position information of the table;
  • the training data set construction module is used to randomly select a row in each of the tables to divide each table in the table data set, and obtain the position information of the upper half block of each of the tables and each table.
  • the position information of the lower half block of the table merge the position information of the upper half block of each table and the position information of the lower half block of each described table to obtain positive sample data, and combine the positive sample data Mark as the first mark, randomly select the position information of the upper half block of each table and the position information of the upper half block of other tables to obtain negative sample data, mark the negative sample data as the second mark, so
  • the positive sample data and the negative sample data form sample training data, and the sample training data and corresponding annotations form a cross-page table training data set;
  • a model training module for constructing a deep learning model based on a pre-trained model of a deep bidirectional converter, constructing the input data of the deep learning model according to the cross-page table training data set, and converting the data in the cross-page table training data set.
  • Each cell in the table is used as the step size of the input of the deep learning model, and the two-category prediction value corresponding to each sample training data in the cross-page table training data set is marked as the output of the deep learning model.
  • train and optimize the deep learning model to obtain a table merging model;
  • the test data construction module is used to obtain the PDF test document, collect the text information and position information of each page in the PDF test document, and remove the text information and position of the header and footer of each page in the PDF test document information, according to the position information of each page to determine whether there is a table at the bottom and top of each page, when there is a table at the bottom of the page and the top of the next page of the page, merge the position of the table at the bottom of the page.
  • the information and the position information of the table at the top of the next page of the page, and the combined result is used as the test data for the cross-page table;
  • a prediction module configured to use the table merging model to predict and obtain a two-category predicted value according to the cross-page table test data, and the two-category forecast value is used to determine whether the cross-page table test data needs to be merged;
  • the merging module is configured to merge the table at the bottom of the page and the table at the top of the next page to obtain a result table, and display the result table according to an instruction when it is determined that the cross-page table test data needs to be merged.
  • At least two PDF documents are acquired, and at least one table in each of the PDF documents is collected to obtain a table data set; a cross-page table training data set is generated according to the table data set; and a cross-page table training data set is used for training Deep learning model, get the table merging model, get the PDF test document, remove the header and footer, build the test data of the cross-page table
  • Use the table merging model to predict the two-category prediction value that needs to be merged for the test data of the cross-page table, and predict according to the two-category
  • the value judges whether the test data of the cross-page table needs to be merged, merge and output the cross-page table that needs to be merged, which can effectively handle the task of cross-page extraction of complex tables in the PDF document, and has a high accuracy rate for judging whether the cross-page table needs to be merged .
  • FIG. 1 is a flowchart of a method for merging tables across pages in a PDF document according to an embodiment of the present application.
  • FIG. 2 is a structural diagram of an apparatus for merging tables across pages in a PDF document according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of an electronic device in an embodiment of the present application.
  • the method for merging tables across pages of a PDF document of the present application is applied in one or more electronic devices.
  • the electronic device is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, application specific integrated circuits (ASICs) , programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASICs application specific integrated circuits
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • the electronic device may be a computing device such as a desktop computer, a notebook computer, a tablet computer, and a cloud server.
  • the device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad, or a voice-activated device.
  • FIG. 1 is a flowchart of a method for merging tables across pages in a PDF document in an embodiment of the present application. According to different requirements, the order of the steps in the flowchart can be changed, and some steps can be omitted.
  • the method for merging tables across pages in a PDF document specifically includes the following steps:
  • Step S11 Acquire at least two PDF documents containing tables, collect location information and text information of at least one table in each of the PDF documents, and obtain a table data set according to the location information of the tables.
  • collecting location information and text information of at least one table in each of the PDF documents, and obtaining a table data set according to the location information of the table includes:
  • collecting the position information and text information of at least one table in each of the PDF documents, and obtaining the table data set according to the position information of the table includes:
  • the PDF document may be a document related to different fields and recording various types of information, such as: financial field, business field, medical field and other different technical fields; the text information is in addition to For all text information other than pictures, the location information includes: header, footer, title, body text, table location information, etc.
  • Step S12 for each table in the table data set, randomly select a row in each table to divide, and obtain the position information of the upper half block of each table and the lower half of each table.
  • the position information of the block merge the position information of the upper half block of each table and the position information of the lower half block of each table to obtain positive sample data, and mark the positive sample data as the first mark , randomly select the position information of the upper half block of each table and the position information of the upper half block of other tables to obtain negative sample data, mark the negative sample data as the second mark, and the positive sample data and
  • the negative sample data forms sample training data, and the sample training data and corresponding annotations form a cross-page table training data set.
  • the first flag may be 1 and the second flag may be 0.
  • generating a cross-page table training data set according to the table data set includes:
  • the first table randomly select a row except the first row and the last row in the first table to divide, and obtain the position information of the upper half block of the first table and the position information of the lower half block of the first table
  • the second table randomly select a row except the first row and the last row in the second table to divide, and obtain the position information of the upper half block of the second table and the position of the lower half block of the second table information, the upper half of the block and the lower half of the block are the upper half of the table and the lower half of the table obtained after the table is divided;
  • the position information of the upper half block of a table obtains the second negative sample data, and the first negative sample data and the second negative sample data are marked as 0;
  • the first positive sample data, the second positive sample data, the first negative sample data, and the second negative sample data form the sample training data, and the sample training data and corresponding annotations form a page spread Tabular training dataset.
  • the location information of the block includes: the x coordinate of the upper left corner of the block, the y coordinate of the upper left corner of the block, the width of the block, the height of the block, the x coordinate of the upper left corner of the cell, The y-coordinate of the upper-left corner of the cell, the width of the cell, the height of the cell, and the number of columns in the block.
  • Step S13 constructing a deep learning model based on the pre-training model of the deep bidirectional converter, constructing the input data of the deep learning model according to the cross-page table training data set, and converting each of the data in the cross-page table training data set.
  • the cell in the table is used as the step size of the input of the deep learning model, and the two-category prediction value corresponding to each sample training data in the cross-page table training data set is marked as the output of the deep learning model.
  • the deep learning model is optimized to obtain a table merging model.
  • constructing the input data of the deep learning model according to the cross-page table training data set includes:
  • the sample training data and the labeling of the sample training data in the cross-page table training data set are constructed as data conforming to the model input format, and used as the input data of the deep learning model, wherein the model input format is [SEP] +table 1 _cell 1 +table 1 _cell 2 +...+table 1 _cell m +[SEP]+table 2 _cell 1 +table 2 _cell 2 +...+table 2 _cell n +[SEP], where table 1 and table 2 Represents two blocks, table_cell represents the feature composed of cell location information in the block, m represents the number of cells in table 1 , and n represents the number of cells in table 2.
  • [SEP] is A sequence composed of m "1"s, when m is less than n, [SEP] is a sequence composed of n "1"s, and the data in the table_cell is [x_t, y_t, w_t, h_t, x_t+w_t, y_t+ h_t,(x_t+w_t)/h_t,(y_t+h_t)/2,x_c,y_c,w_c,h_c,x_c+w_c,y_c+h_c,(x_c+w_c)/h_c,(y_c+h_c)/2, a], where x_t is the x coordinate of the upper left corner of the block, y_t is the y coordinate of the upper left corner of the block, w_t is the width of the block, h_t is the height of the block, x_c is the x coordinate
  • the table 1 and table 2 may represent the upper half block and the lower half block of the same table, or may represent the two upper half blocks of different tables; when the number of columns in table 1 and table 2 is the same , a is 1, when the number of columns of table 1 and table 2 are different, a is 0.
  • labeling the corresponding binary prediction value of each sample training data in the cross-page table training data set as the output of the deep learning model includes:
  • the two-class prediction value at [SEP] in the output of the deep learning model is the first preset value
  • the predicted value of the binary classification at [SEP] in the output of the deep learning model is a second preset value.
  • the first preset value may be 1, and the second preset value may be 0.
  • the two-class predicted value is the probability that two blocks in the sample training data come from the same table, and when the two-class predicted value is the first preset value, Indicates that the two blocks in the sample training data are from the same table, and the probability that the sample training data needs to be merged is 1, that is, the sample training data needs to be merged; when the two-class predicted value is the second preset value , indicating that the two blocks in the sample training data are from different tables, and the probability that the sample training data needs to be merged is 0, that is, the sample training data does not need to be merged.
  • the training and optimization of the deep learning model to obtain a table merging model includes:
  • the prediction layer is trained until the prediction layer converges, and the table merging model is obtained, and the output of the table merging model is a two-class prediction value for predicting whether the sample training data needs to be merged.
  • the predicted value of the binary classification is any value between 0 and 1, it can be determined whether a certain sample needs to be trained according to the preset comparison value of 0.5.
  • the data is merged. When it is greater than or equal to 0.5, it is determined that the sample training data needs to be merged; when it is less than 0.5, it is determined that the sample training data does not need to be merged.
  • the two-class prediction value predicted by the table merging model according to the sample training data is greater than or equal to 0.5, and the probability that two blocks in the sample training data come from the same table is greater than or equal to 0.5, then, It can be determined that two blocks in the sample training data need to be merged; or when the two-class prediction value predicted by the table merging model according to the sample training data is less than 0.5, the two blocks in the sample training data The probability that a block is from the same table is less than 0.5, then it can be determined that the two blocks in the sample training data do not need to be merged.
  • Step S14 obtaining the PDF test document, collecting the text information and position information of each page in the PDF test document, removing the text information and position information of the header and footer of each page in the PDF test document, according to the The position information of each page is used to judge whether there is a table at the bottom and the top of each page, and when there is a table at the bottom of the page and the top of the next page of the page, the position information of the table at the bottom of the page is combined with the The position information of the table at the top of the next page of the page, and the combined result will be used as the test data of the cross-page table.
  • removing the text information and position information of the header and footer of each page in the PDF test document includes:
  • the height of the first quantile value of the average height of the page is taken as the candidate area of the header, and the height of the second quantile value of the average height of the page is taken as the candidate area of the footer;
  • the first edit distance is less than a preset first threshold
  • it is determined that the text in the candidate area is a header, and the text information and position information of the header are removed
  • the second edit distance is less than
  • the preset second threshold it is determined that the text in the candidate area is a footer, and the text information and position information of the footer are removed.
  • the edit distance is a quantitative measurement of the degree of difference between two character strings. Specifically, the edit distance is converted from a character string through operations such as insertion, modification, deletion, etc. to Minimal steps required for another string.
  • the page height mean value h_mean of all pages in the PDF test document when extracting the header of the PDF test document, calculate the page height mean value h_mean of all pages in the PDF test document, and take the part of the upper fifth page of h_mean as the candidate area of the page header. Test each page in the document, extract the text information and position information in the candidate area of the header, calculate the edit distance between the text in the candidate area of the header and the text in the candidate areas of each 3 pages before and after the page, It is determined that the content whose editing distance is less than the first threshold is a page header, and the text information and position information of the page header are removed.
  • the footer of the PDF test document when extracting the footer of the PDF test document, calculate the page height mean h_mean of all pages in the PDF test document, and take the lower one-fifth page portion of h_mean as the candidate area of the footer, For each page in the PDF test document, extract the text information and position information in the candidate area of the footer, and calculate the edit distance between the text in the candidate area of the footer and the text in the candidate areas of the three pages before and after the page. , determine that the content whose editing distance is less than the second threshold is a footer, and remove the text information and position information of the footer.
  • Step S15 using the table merging model to predict and obtain a two-category predicted value according to the cross-page table test data, and the two-category predicted value is used to determine whether the cross-page table test data needs to be merged.
  • using the table merging model to predict and obtain a two-category predicted value includes:
  • the table merging model predicts, according to the input data, a two-category prediction value that needs to be merged between the table at the bottom of the page and the table at the top of the next page in the cross-page table test data.
  • the two-category prediction value predicted by the table merging model for the test data of the cross-page table is greater than or equal to 0.5, it means that the data in the test data of the cross-page table is greater than or equal to 0.5.
  • the table at the bottom of the page and the table at the top of the next page belong to the same table, so it is judged that the table at the bottom of the page and the table at the top of the next page need to be merged; when the predicted value of the two categories is less than 0.5, it means that the In the cross-page table test data, the table at the bottom of the page and the table at the top of the next page belong to different tables, so it is judged that the table at the bottom of the page and the table at the top of the next page do not need to be merged.
  • Step S16 when it is determined that the cross-page table test data needs to be combined, the table at the bottom of the page and the table at the top of the next page are combined to obtain a result table, and the result table is displayed according to an instruction.
  • combining the table at the bottom of the page and the table at the top of the next page to obtain a result table, and displaying the result table according to an instruction includes:
  • the result table is displayed.
  • the table at the bottom of the page and the next page are merged according to the extracted position information of the table at the bottom of the page and the position information of the table at the top of the next page
  • the table at the top of the page, the resulting table can include:
  • the table picture at the bottom of the page after the width adjustment is combined with the table picture at the top of the next page to obtain a result table.
  • the table at the bottom of the page and the The table at the top of the next page, the resulting table can include:
  • the text information in the table at the bottom of the page and the corresponding text information in the table at the top of the next page are merged , to get the result table.
  • displaying the results table may include:
  • the result table is extracted from the database, and the result table is scaled according to the height and width ratio of the result table according to the page size of the document. make the height of the result table smaller than the height of the document, make the width of the result table smaller than the width of the document, and display the result table on one page of the document.
  • displaying the result table may include:
  • the result table is extracted from the database, and the result table is scaled according to the height and width ratio of the result table according to the page size of the document. make the height of the result table smaller than the height of the document, make the width of the result table smaller than the width of the document, and display the result table on one page of the document.
  • the data and output results in the processing process can be stored in the blockchain, such as the face image training data, The first feature map, the first geometric relationship matrix, the face picture test data, the second input data, the face key points, etc.
  • This application obtains at least two PDF documents, and collects at least one table in each of the PDF documents to obtain a table data set; generates a cross-page table training data set according to the table data set; uses the cross-page table training data set to train deep learning Model, get the table merging model, obtain the PDF test document, remove the header and footer, and construct the test data of the cross-page table.
  • Use the table merging model to predict the two-category prediction value that needs to be merged for the cross-page table test data, and judge according to the two-category prediction value.
  • Cross-page tables test whether data needs to be merged, merge and output the cross-page tables that need to be merged, can effectively handle the task of cross-page extraction of complex tables in PDF documents, and have a high accuracy rate for judging whether the cross-page tables need to be merged.
  • FIG. 2 is a structural diagram of an apparatus 30 for merging tables in PDF documents according to an embodiment of the present application.
  • the PDF document cross-page table merging apparatus 30 runs in an electronic device.
  • the PDF document cross-page table merging apparatus 30 may include a plurality of functional modules composed of program code segments.
  • the program codes of each program segment in the PDF document cross-page table merging apparatus 30 may be stored in the memory and executed by at least one processor to perform the PDF document cross-page table merging function.
  • the PDF document cross-page table merging apparatus 30 may be divided into a plurality of functional modules according to the functions performed by the apparatus 30 .
  • the PDF document cross-page table combining device 30 may include a table data acquisition module 301 , a training data set construction module 302 , a model training module 303 , a test data construction module 304 , a prediction module 305 and a combining module 306 .
  • a module referred to in this application refers to a series of computer-readable instruction segments that can be executed by at least one processor and can perform fixed functions, and are stored in a memory. In some embodiments, the functions of each module will be described in detail in subsequent embodiments.
  • the form data obtaining module 301 obtains at least two PDF documents containing forms, collects position information and text information of at least one form in each of the PDF documents, and obtains a form data set according to the position information of the form.
  • the form data acquisition module 301 collects position information and text information of at least one form in each of the PDF documents, and obtains form data according to the position information of the form Sets include:
  • the form data acquisition module 301 collects position information and text information of at least one form in each of the PDF documents, and obtains a form data set according to the position information of the form include:
  • the PDF document may be a document that records various types of information related to different fields, for example, different technical fields such as the financial field, the commercial field, the medical field, etc.; the text information is in addition to For all text information other than pictures, the location information includes: header, footer, title, body text, table location information, etc.
  • the training data set construction module 302 randomly selects a row in each table to divide each table in the table data set, and obtains the position information of the upper half block of each table and each table.
  • the position information of the lower half block of the table merge the position information of the upper half block of each table and the position information of the lower half block of each described table to obtain positive sample data, and combine the positive sample data Mark as the first mark, randomly select the position information of the upper half block of each table and the position information of the upper half block of other tables to obtain negative sample data, mark the negative sample data as the second mark, so
  • the positive sample data and the negative sample data form sample training data, and the sample training data and corresponding annotations form a cross-page table training data set.
  • the first flag may be 1 and the second flag may be 0.
  • the training data set construction module 302 when the table data set includes a first table and a second table, the training data set construction module 302 generates a cross-page table training data set according to the table data set, including:
  • the first table randomly select a row except the first row and the last row in the first table to divide, and obtain the position information of the upper half block of the first table and the position information of the lower half block of the first table
  • the second table randomly select a row except the first row and the last row in the second table to divide, and obtain the position information of the upper half block of the second table and the position of the lower half block of the second table information, the upper half of the block and the lower half of the block are the upper half of the table and the lower half of the table obtained after the table is divided;
  • the position information of the upper half block of a table obtains the second negative sample data, and the first negative sample data and the second negative sample data are marked as 0;
  • the first positive sample data, the second positive sample data, the first negative sample data, and the second negative sample data form the sample training data, and the sample training data and corresponding annotations form a page spread Tabular training dataset.
  • the location information of the block includes: the x coordinate of the upper left corner of the block, the y coordinate of the upper left corner of the block, the width of the block, the height of the block, the x coordinate of the upper left corner of the cell, The y-coordinate of the upper-left corner of the cell, the width of the cell, the height of the cell, and the number of columns in the block.
  • the model training module 303 constructs a deep learning model based on the pre-trained model of the deep bidirectional converter, constructs the input data of the deep learning model according to the cross-page table training data set, and combines the data in the cross-page table training data set.
  • Each cell in the table is used as the step size of the input of the deep learning model, and the two-category prediction value corresponding to each sample training data in the cross-page table training data set is marked as the output of the deep learning model.
  • constructing the input data of the deep learning model according to the cross-page table training data set includes:
  • the sample training data and the labeling of the sample training data in the cross-page table training data set are constructed as data conforming to the model input format, and used as the input data of the deep learning model, wherein the model input format is [SEP] +table 1 _cell 1 +table 1 _cell 2 +...+table 1 _cell m +[SEP]+table 2 _cell 1 +table 2 _cell 2 +...+table 2 _cell n +[SEP], where table 1 and table 2 Represents two blocks, table_cell represents the feature composed of cell location information in the block, m represents the number of cells in table 1 , and n represents the number of cells in table 2.
  • [SEP] is A sequence composed of m "1"s, when m is less than n, [SEP] is a sequence composed of n "1"s, and the data in the table_cell is [x_t, y_t, w_t, h_t, x_t+w_t, y_t+ h_t,(x_t+w_t)/h_t,(y_t+h_t)/2,x_c,y_c,w_c,h_c,x_c+w_c,y_c+h_c,(x_c+w_c)/h_c,(y_c+h_c)/2, a], where x_t is the x coordinate of the upper left corner of the block, y_t is the y coordinate of the upper left corner of the block, w_t is the width of the block, h_t is the height of the block, x_c is the x coordinate
  • the table 1 and table 2 may represent the upper half block and the lower half block of the same table, or may represent the two upper half blocks of different tables; when the number of columns in table 1 and table 2 is the same , a is 1, when the number of columns of table 1 and table 2 are different, a is 0.
  • labeling the corresponding binary prediction value of each sample training data in the cross-page table training data set as the output of the deep learning model includes:
  • the second-class prediction value at [SEP] in the output of the deep learning model is the first preset value
  • the predicted value of the binary classification at [SEP] in the output of the deep learning model is a second preset value.
  • the first preset value may be 1, and the second preset value may be 0.
  • the two-class predicted value is the probability that two blocks in the sample training data are the same table, and when the two-class predicted value is the first preset value, Indicates that the two blocks in the sample training data are from the same table, and the probability that the sample training data needs to be merged is 1, that is, the sample training data needs to be merged; when the two-class predicted value is the second preset value , indicating that the two blocks in the sample training data are from different tables, and the probability that the sample training data needs to be merged is 0, that is, the sample training data does not need to be merged.
  • the training and optimization of the deep learning model to obtain a table merging model includes:
  • the prediction layer is trained until the prediction layer converges, and the table merging model is obtained, and the output of the table merging model is a two-class prediction value for predicting whether the sample training data needs to be merged.
  • the predicted value of the binary classification is any value between 0 and 1, it can be determined whether a certain sample needs to be trained according to the preset comparison value of 0.5.
  • the data is merged. When it is greater than or equal to 0.5, it is determined that the sample training data needs to be merged; when it is less than 0.5, it is determined that the sample training data does not need to be merged.
  • the test data construction module 304 acquires the PDF test document, collects the text information and position information of each page in the PDF test document, and removes the text information and position of the header and footer of each page in the PDF test document information, according to the position information of each page to determine whether there is a table at the bottom and top of each page, when there is a table at the bottom of the page and the top of the next page of the page, merge the position of the table at the bottom of the page.
  • the information and the position information of the table at the top of the next page of the page, and the combined result will be used as the test data of the cross-page table.
  • removing the text information and position information of the header and footer of each page in the PDF test document includes:
  • the height of the first quantile value of the average height of the page is taken as the candidate area of the header, and the height of the second quantile value of the average height of the page is taken as the candidate area of the footer;
  • the first edit distance is less than a preset first threshold
  • it is determined that the text in the candidate area is a header, and the text information and position information of the header are removed
  • the second edit distance is less than
  • the preset second threshold it is determined that the text in the candidate area is a footer, and the text information and position information of the footer are removed.
  • the edit distance is a quantitative measurement of the degree of difference between two character strings. Specifically, the edit distance is converted from a character string through operations such as insertion, modification, deletion, etc. to Minimal steps required for another string.
  • the prediction module 305 predicts and obtains a two-category prediction value by using the table merging model according to the cross-page table test data, and the two-category prediction value is used to determine whether the cross-page table test data needs to be merged.
  • using the table merging model to predict and obtain a two-category predicted value includes:
  • the table merging model predicts, according to the input data, a two-category prediction value that needs to be merged between the table at the bottom of the page and the table at the top of the next page in the cross-page table test data.
  • the two-category prediction value predicted by the table merging model for the test data of the cross-page table is greater than or equal to 0.5, it means that the data in the test data of the cross-page table is greater than or equal to 0.5.
  • the table at the bottom of the page and the table at the top of the next page belong to the same table, so it is judged that the table at the bottom of the page and the table at the top of the next page need to be merged; when the predicted value of the two categories is less than 0.5, it means that the In the cross-page table test data, the table at the bottom of the page and the table at the top of the next page belong to different tables, so it is judged that the table at the bottom of the page and the table at the top of the next page do not need to be merged.
  • the merging module 306 merges the table at the bottom of the page and the table at the top of the next page to obtain a result table, and displays the result table according to the instruction when it is determined that the cross-page table test data needs to be merged.
  • combining the table at the bottom of the page and the table at the top of the next page to obtain a result table, and displaying the result table according to an instruction includes:
  • the result table is displayed.
  • This application obtains at least two PDF documents, and collects at least one table in each of the PDF documents to obtain a table data set; generates a cross-page table training data set according to the table data set; uses the cross-page table training data set to train deep learning Model, get the table merging model, get the PDF test document, remove the header and footer, construct the test data of the cross-page table
  • Use the table merging model to predict the two-category prediction value that needs to be merged for the test data of the cross-page table, and judge according to the two-category prediction value
  • Cross-page tables test whether data needs to be merged, merge and output the cross-page tables that need to be merged, can effectively handle the task of cross-page extraction of complex tables in PDF documents, and have a high accuracy rate for judging whether the cross-page tables need to be merged.
  • FIG. 3 is a schematic diagram of an electronic device 6 in an embodiment of the present application.
  • the electronic device 6 includes a memory 61 , a processor 62 and computer readable instructions stored in the memory 61 and executable on the processor 62 .
  • the processor 62 executes the computer-readable instructions
  • the steps in the above embodiments of the PDF document cross-page table merging method are implemented, for example, steps S11 to S16 shown in FIG. 1 .
  • the processor 62 executes the computer-readable instructions
  • the functions of the modules/units in the above embodiments of the apparatus for merging tables in a PDF document across pages are implemented, for example, modules 301 to 306 in FIG. 2 .
  • the computer-readable instructions may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 61 and executed by the processor 62 to Complete this application.
  • the one or more modules/units may be a series of computer-readable instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions in the electronic device 6 .
  • the computer readable instructions can be divided into the tabular data acquisition module 301, the training data set construction module 302, the model training module 303, the test data construction module 304, the prediction module 305 and the merge module 306 in FIG. 2, each module For specific functions, refer to Embodiment 2.
  • the electronic device 6 may be a computing device such as a desktop computer, a notebook, a palmtop computer, a server, and a cloud terminal device.
  • a computing device such as a desktop computer, a notebook, a palmtop computer, a server, and a cloud terminal device.
  • the schematic diagram is only an example of the electronic device 6, and does not constitute a limitation to the electronic device 6, and may include more or less components than the one shown, or combine some components, or different Components such as the electronic device 6 may also include input and output devices, network access devices, buses, and the like.
  • the so-called processor 62 may be a central processing module (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor 62 can also be any conventional processor, etc.
  • the processor 62 is the control center of the electronic device 6, and uses various interfaces and lines to connect the entire electronic device 6. of each part.
  • the memory 61 may be used to store the computer-readable instructions and/or modules/units, and the processor 62 executes or executes the computer-readable instructions and/or modules/units stored in the memory 61, and calls The data stored in the memory 61 realizes various functions of the electronic device 6 .
  • the memory 61 may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.); the storage data area may Data and the like created according to the use of the electronic device 6 are stored.
  • the memory 61 may include volatile memory, and may also include non-volatile memory, such as hard disk, internal memory, plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card , a flash memory card (Flash Card), at least one disk storage device, flash memory device, or other storage device.
  • non-volatile memory such as hard disk, internal memory, plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card , a flash memory card (Flash Card), at least one disk storage device, flash memory device, or other storage device.
  • the modules/units integrated in the electronic device 6 are implemented in the form of software functional modules and sold or used as independent products, they may be stored in a computer-readable storage medium. Based on this understanding, the present application can implement all or part of the processes in the methods of the above embodiments, and can also be completed by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium. , the computer-readable instructions, when executed by the processor, can implement the steps of the above-mentioned method embodiments. Wherein, the computer-readable instructions include computer-readable instruction codes, and the computer-readable instruction codes may be in source code form, object code form, executable file, or some intermediate form, and the like.
  • the computer-readable medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only). Memory), random access memory (RAM, Random Access Memory), etc.
  • the computer-readable storage medium described in this application may be non-volatile or volatile.
  • Blockchain is essentially a decentralized database, which is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of its information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • each functional module in each embodiment of the present application may be integrated in the same processing module, or each module may exist physically alone, or two or more modules may be integrated in the same module.
  • the above-mentioned integrated modules can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Document Processing Apparatus (AREA)

Abstract

本申请涉及人工智能技术领域,提供一种PDF文档跨页表格合并方法、装置、电子设备及存储介质。所述PDF文档跨页表格合并方法包括:获取至少两个PDF文档,并采集每个所述PDF文档中的至少一个表格,得到表格数据集;根据表格数据集生成跨页表格训练数据集;使用跨页表格训练数据集训练深度学习模型,得到表格合并模型,获取PDF测试文档,去除页眉和页脚,构建跨页表格测试数据利用表格合并模型预测跨页表格测试数据需要合并的二分类预测值,并根据二分类预测值判断跨页表格测试数据是否需要合并,合并并输出需要合并的跨页表格。本申请可以有效地处理PDF文档中复杂表格跨页提取的任务,对判断跨页表格是否需要合并有较高的准确率。

Description

PDF文档跨页表格合并方法、装置、电子设备及存储介质
本申请要求于2020年11月17日提交中国专利局,申请号为202011290521.7申请名称为“PDF文档跨页表格合并方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能中的文本处理技术领域,具体涉及一种PDF文档跨页表格合并方法、装置、电子设备及存储介质。
背景技术
PDF格式被广泛应用于各种文件的存储和传输,常常需要从PDF文档中提取信息。由于PDF文档中经常出现表格,但是发明人发现由于PDF文档格式中不存在表格格式,解析PDF文档后得到的表格只有文字和图像线,当PDF文档中某一页面底部与下一页顶部同时出现表格时,需要对是否为同一表格进行判断。现有技术中,PDF文档跨页表格合并主要利用规则判断跨页的两个表格中是否含有相同的列数,对于复杂表格跨页的情况,规则方法不能起到很好的判断效果。
发明内容
鉴于以上内容,有必要提出一种PDF文档跨页表格合并方法、装置、电子设备及存储介质以实现对复杂跨页表格是否需要合并进行判断。
本申请的第一方面提供一种PDF文档跨页表格合并方法,所述方法包括:
获取至少两个包含表格的PDF文档,并采集每个所述PDF文档中的至少一个表格的位置信息和文本信息,并根据所述表格的位置信息得到表格数据集;
对所述表格数据集中的每个表格,随机选取每个所述表格中的一行进行划分,得到每个所述表格的上半区块的位置信息和每个所述表格的下半区块的位置信息,合并所述每个表格的上半区块的位置信息和每个所述表格的下半区块的位置信息得到正样本数据,将所述正样本数据标注为第一标记,随机选择所述每个表格的上半区块的位置信息和其他表格的上半区块的位置信息得到负样本数据,将所述负样本数据标注为第二标记,所述正样本数据与所述负样本数据组成样本训练数据,所述样本训练数据与对应的标注组成跨页表格训练数据集;
构建基于深度双向变换器的预训练模型的深度学习模型,根据所述跨页表格训练数据集构造所述深度学习模型的输入数据,将所述跨页表格训练数据集中的每个所述表格中的单元格作为所述深度学习模型的输入的步长,将所述跨页表格训练数据集中每个样本训练数据标注对应的二分类预测值作为所述深度学习模型的输出,训练并优化所述深度学习模型,得到表格合并模型;
获取PDF测试文档,采集所述PDF测试文档中每个页面的文本信息和位置信息,去除所述PDF测试文档中每个页面的页眉与页脚的文本信息和位置信息,根据所述每个页面的位置信息判断每个页面的底部和顶部是否存在表格,当所述页面底部与所述页面的下一页面顶部存在表格时,合并所述页面底部的表格的位置信息与所述页面的下一页面顶部的表格的位置信息,将合并结果作为跨页表格测试数据;
根据所述跨页表格测试数据,利用所述表格合并模型预测得到二分类预测值,所述二分类预测值用于判断所述跨页表格测试数据是否需要合并;
当判断所述跨页表格测试数据需要合并时,合并所述页面底部的表格与所述下一页 面顶部的表格得到结果表格,并根据指令显示所述结果表格。
本申请的第二方面提供一种电子设备,所述电子设备包括存储器及处理器,所述存储器用于存储至少一个计算机可读指令,所述处理器用于执行所述至少一个计算机可读指令以实现以下步骤:
获取至少两个包含表格的PDF文档,并采集每个所述PDF文档中的至少一个表格的位置信息和文本信息,并根据所述表格的位置信息得到表格数据集;
对所述表格数据集中的每个表格,随机选取每个所述表格中的一行进行划分,得到每个所述表格的上半区块的位置信息和每个所述表格的下半区块的位置信息,合并所述每个表格的上半区块的位置信息和每个所述表格的下半区块的位置信息得到正样本数据,将所述正样本数据标注为第一标记,随机选择所述每个表格的上半区块的位置信息和其他表格的上半区块的位置信息得到负样本数据,将所述负样本数据标注为第二标记,所述正样本数据与所述负样本数据组成样本训练数据,所述样本训练数据与对应的标注组成跨页表格训练数据集;
构建基于深度双向变换器的预训练模型的深度学习模型,根据所述跨页表格训练数据集构造所述深度学习模型的输入数据,将所述跨页表格训练数据集中的每个所述表格中的单元格作为所述深度学习模型的输入的步长,将所述跨页表格训练数据集中每个样本训练数据标注对应的二分类预测值作为所述深度学习模型的输出,训练并优化所述深度学习模型,得到表格合并模型;
获取PDF测试文档,采集所述PDF测试文档中每个页面的文本信息和位置信息,去除所述PDF测试文档中每个页面的页眉与页脚的文本信息和位置信息,根据所述每个页面的位置信息判断每个页面的底部和顶部是否存在表格,当所述页面底部与所述页面的下一页面顶部存在表格时,合并所述页面底部的表格的位置信息与所述页面的下一页面顶部的表格的位置信息,将合并结果作为跨页表格测试数据;
根据所述跨页表格测试数据,利用所述表格合并模型预测得到二分类预测值,所述二分类预测值用于判断所述跨页表格测试数据是否需要合并;
当判断所述跨页表格测试数据需要合并时,合并所述页面底部的表格与所述下一页面顶部的表格得到结果表格,并根据指令显示所述结果表格。
本申请的第三方面提供一种计算机可读存储介质,所述计算机可读存储介质存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行时实现以下步骤:
获取至少两个包含表格的PDF文档,并采集每个所述PDF文档中的至少一个表格的位置信息和文本信息,并根据所述表格的位置信息得到表格数据集;
对所述表格数据集中的每个表格,随机选取每个所述表格中的一行进行划分,得到每个所述表格的上半区块的位置信息和每个所述表格的下半区块的位置信息,合并所述每个表格的上半区块的位置信息和每个所述表格的下半区块的位置信息得到正样本数据,将所述正样本数据标注为第一标记,随机选择所述每个表格的上半区块的位置信息和其他表格的上半区块的位置信息得到负样本数据,将所述负样本数据标注为第二标记,所述正样本数据与所述负样本数据组成样本训练数据,所述样本训练数据与对应的标注组成跨页表格训练数据集;
构建基于深度双向变换器的预训练模型的深度学习模型,根据所述跨页表格训练数据集构造所述深度学习模型的输入数据,将所述跨页表格训练数据集中的每个所述表格中的单元格作为所述深度学习模型的输入的步长,将所述跨页表格训练数据集中每个样本训练数据标注对应的二分类预测值作为所述深度学习模型的输出,训练并优化所述深度学习模型,得到表格合并模型;
获取PDF测试文档,采集所述PDF测试文档中每个页面的文本信息和位置信息,去除所述PDF测试文档中每个页面的页眉与页脚的文本信息和位置信息,根据所述每个页 面的位置信息判断每个页面的底部和顶部是否存在表格,当所述页面底部与所述页面的下一页面顶部存在表格时,合并所述页面底部的表格的位置信息与所述页面的下一页面顶部的表格的位置信息,将合并结果作为跨页表格测试数据;
根据所述跨页表格测试数据,利用所述表格合并模型预测得到二分类预测值,所述二分类预测值用于判断所述跨页表格测试数据是否需要合并;
当判断所述跨页表格测试数据需要合并时,合并所述页面底部的表格与所述下一页面顶部的表格得到结果表格,并根据指令显示所述结果表格。
本申请的第四方面提供一种PDF文档跨页表格合并装置,所述装置包括:
表格数据获取模块,用于获取至少两个包含表格的PDF文档,并采集每个所述PDF文档中的至少一个表格的位置信息和文本信息,并根据所述表格的位置信息得到表格数据集;
训练数据集构造模块,用于对所述表格数据集中的每个表格,随机选取每个所述表格中的一行进行划分,得到每个所述表格的上半区块的位置信息和每个所述表格的下半区块的位置信息,合并所述每个表格的上半区块的位置信息和每个所述表格的下半区块的位置信息得到正样本数据,将所述正样本数据标注为第一标记,随机选择所述每个表格的上半区块的位置信息和其他表格的上半区块的位置信息得到负样本数据,将所述负样本数据标注为第二标记,所述正样本数据与所述负样本数据组成样本训练数据,所述样本训练数据与对应的标注组成跨页表格训练数据集;
模型训练模块,用于构建基于深度双向变换器的预训练模型的深度学习模型,根据所述跨页表格训练数据集构造所述深度学习模型的输入数据,将所述跨页表格训练数据集中的每个所述表格中的单元格作为所述深度学习模型的输入的步长,将所述跨页表格训练数据集中每个样本训练数据标注对应的二分类预测值作为所述深度学习模型的输出,训练并优化所述深度学习模型,得到表格合并模型;
测试数据构造模块,用于获取PDF测试文档,采集所述PDF测试文档中每个页面的文本信息和位置信息,去除所述PDF测试文档中每个页面的页眉与页脚的文本信息和位置信息,根据所述每个页面的位置信息判断每个页面的底部和顶部是否存在表格,当所述页面底部与所述页面的下一页面顶部存在表格时,合并所述页面底部的表格的位置信息与所述页面的下一页面顶部的表格的位置信息,将合并结果作为跨页表格测试数据;
预测模块,用于根据所述跨页表格测试数据,利用所述表格合并模型预测得到二分类预测值,所述二分类预测值用于判断所述跨页表格测试数据是否需要合并;
合并模块,用于当判断所述跨页表格测试数据需要合并时,合并所述页面底部的表格与所述下一页面顶部的表格得到结果表格,并根据指令显示所述结果表格。
本申请中,获取至少两个PDF文档,并采集每个所述PDF文档中的至少一个表格,得到表格数据集;根据表格数据集生成跨页表格训练数据集;使用跨页表格训练数据集训练深度学习模型,得到表格合并模型,获取PDF测试文档,去除页眉和页脚,构建跨页表格测试数据利用表格合并模型预测跨页表格测试数据需要合并的二分类预测值,并根据二分类预测值判断跨页表格测试数据是否需要合并,合并并输出需要合并的跨页表格,可以有效地处理PDF文档中复杂表格跨页提取的任务,对判断跨页表格是否需要合并有较高的准确率。
附图说明
图1为本申请一实施方式中PDF文档跨页表格合并方法的流程图。
图2为本申请一实施方式中PDF文档跨页表格合并装置的结构图。
图3为本申请一实施方式中电子设备的示意图。
具体实施方式
为了能够更清楚地理解本申请的上述目的、特征和优点,下面结合附图和具体实施例对本申请进行详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。
在下面的描述中阐述了很多具体细节以便于充分理解本申请,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。
优选地,本申请PDF文档跨页表格合并方法应用在一个或者多个电子设备中。所述电子设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述电子设备可以是桌上型计算机、笔记本电脑、平板电脑及云端服务器等计算设备。所述设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。
实施例1
图1是本申请一实施方式中PDF文档跨页表格合并方法的流程图。根据不同的需求,所述流程图中步骤的顺序可以改变,某些步骤可以省略。
参阅图1所示,所述PDF文档跨页表格合并方法具体包括以下步骤:
步骤S11,获取至少两个包含表格的PDF文档,并采集每个所述PDF文档中的至少一个表格的位置信息和文本信息,并根据所述表格的位置信息得到表格数据集。
具体地,在本申请的至少一个实施例中,采集每个所述PDF文档中的至少一个表格的位置信息和文本信息,并根据所述表格的位置信息得到表格数据集包括:
使用pdfplumber库解析每个所述PDF文档得到每个所述PDF文档的位置信息和文本信息,并从所述位置信息中采集所述PDF文档中表格的位置信息以及所述表格中每个单元格的位置信息作为表格数据集。
具体地,在本申请的其他实施方式中,采集每个所述PDF文档中的至少一个表格的位置信息和文本信息,并根据所述表格的位置信息得到表格数据集包括:
使用其他PDF内容解析库,例如pdfminer、camelot等,解析每个所述PDF文档得到每个所述PDF文档的位置信息和文本信息,并从所述位置信息中采集所述PDF文档中的表格的位置信息和文本信息以及所述表格中每个单元格的位置信息。
在本申请的一个实施例中,所述PDF文档可以是涉及不同领域的、记录各类不同信息的文档,例如:金融领域、商业领域、医疗领域等不同的技术领域;所述文本信息为除图片外的所有文本信息,所述位置信息包括:页眉、页脚、标题、正文、表格的位置信息等。
步骤S12,对所述表格数据集中的每个表格,随机选取每个所述表格中的一行进行划分,得到每个所述表格的上半区块的位置信息和每个所述表格的下半区块的位置信息,合并所述每个表格的上半区块的位置信息和每个所述表格的下半区块的位置信息得到正样本数据,将所述正样本数据标注为第一标记,随机选择所述每个表格的上半区块的位置信息和其他表格的上半区块的位置信息得到负样本数据,将所述负样本数据标注为第二标记,所述正样本数据与所述负样本数据组成样本训练数据,所述样本训练数据与对 应的标注组成跨页表格训练数据集。
例如,所述第一标记可以是1,所述第二标记可以是0。
又例如,当所述表格数据集包括第一表格和第二表格时,根据所述表格数据集生成跨页表格训练数据集包括:
对于所述第一表格,随机选取所述第一表格中除第一行和最后一行以外的一行进行划分,得到第一表格上半区块的位置信息和第一表格下半区块的位置信息,对于所述第二表格,随机选取所述第二表格中除第一行和最后一行以外的一行进行划分,得到第二表格上半区块的位置信息和第二表格下半区块的位置信息,所述上半区块和所述下半区块为表格划分后得到的表格上半部分和表格下半部分;
合并所述第一表格上半区块的位置信息和所述第一表格下半区块的位置信息得到第一正样本数据,合并所述第二表格上半区块的位置信息和所述第二表格下半区块的位置信息得到第二正样本数据,将所述第一正样本数据和所述第二正样本数据标注为1;
合并所述第一表格上半区块的位置信息和所述第二表格上半区块的位置信息得到第一负样本数据,合并所述第二表格上半区块的位置信息和所述第一表格上半区块的位置信息得到第二负样本数据,将所述第一负样本数据和所述第二负样本数据标注为0;
所述第一正样本数据、所述第二正样本数据、所述第一负样本数据、所述第二负样本数据组成所述样本训练数据,所述样本训练数据与对应的标注组成跨页表格训练数据集。
在本申请的一个实施例中,区块的位置信息包括:区块左上角的x坐标,区块左上角的y坐标,区块的宽,区块的高,单元格左上角的x坐标,单元格左上角的y坐标,单元格的宽,单元格的高,区块中列的数量。
步骤S13,构建基于深度双向变换器的预训练模型的深度学习模型,根据所述跨页表格训练数据集构造所述深度学习模型的输入数据,将所述跨页表格训练数据集中的每个所述表格中的单元格作为所述深度学习模型的输入的步长,将所述跨页表格训练数据集中每个样本训练数据标注对应的二分类预测值作为所述深度学习模型的输出,训练并优化所述深度学习模型,得到表格合并模型。
在本申请的至少一个实施例中,根据所述跨页表格训练数据集构造所述深度学习模型的输入数据包括:
将所述跨页表格训练数据集中的样本训练数据和样本训练数据的标注构造为符合模型输入格式的数据,并作为所述深度学习模型的输入数据,其中,所述模型输入格式为[SEP]+table 1_cell 1+table 1_cell 2+…+table 1_cell m+[SEP]+table 2_cell 1+table 2_cell 2+…+table 2_cell n+[SEP],其中,table 1和table 2表示两个区块,table_cell表示由区块中单元格位置信息构成的特征,m表示table 1的单元格数量,n表示table 2的单元格数量,当m大于或等于n时,[SEP]为m个“1”组成的序列,当m小于n时,[SEP]为n个“1”组成的序列,所述table_cell内的数据为[x_t,y_t,w_t,h_t,x_t+w_t,y_t+h_t,(x_t+w_t)/h_t,(y_t+h_t)/2,x_c,y_c,w_c,h_c,x_c+w_c,y_c+h_c,(x_c+w_c)/h_c,(y_c+h_c)/2,a],其中,x_t为区块左上角的x坐标,y_t为区块左上角的y坐标,w_t为区块的宽,h_t为区块的高,x_c为单元格左上角的x坐标,y_c为单元格左上角的y坐标,w_c为单元格的宽,h_c为单元格的高,a为0或1。
具体地,所述table 1和table 2可以表示同一表格的上半区块和下半区块,也可以表示不同表格的两个上半区块;当table 1和table 2的列的数量相同时,a为1,当table 1和table 2的列的数量不同时,a为0。
在本申请的一个实施方式中,将所述跨页表格训练数据集中每个样本训练数据标注对应的二分类预测值作为所述深度学习模型的输出包括:
当所述样本训练数据的标注为所述第一标记时,所述深度学习模型的输出中[SEP]处 的二分类预测值为第一预设值;
当所述样本训练数据的标注为所述第二标记时,所述深度学习模型的输出中[SEP]处的二分类预测值为第二预设值。
例如,所述第一预设值可以是1,所述第二预设值可以是0。
具体地,在本申请的至少一个实施例中,所述二分类预测值为所述样本训练数据中的两个区块来自同一表格的概率,当二分类预测值为第一预设值时,表示所述样本训练数据中的两个区块来自同一表格,所述样本训练数据需要合并的概率为1,即所述样本训练数据需要合并;当所述二分类预测值为第二预设值时,表示所述样本训练数据中的两个区块来自不同表格,所述样本训练数据需要合并的概率为0,即所述样本训练数据不需要合并。
在本申请的至少一个实施例中,所述训练并优化所述深度学习模型,得到表格合并模型包括:
利用所述编码层对所述输入数据进行编码;
训练所述预测层,直至所述预测层收敛,得到所述表格合并模型,所述表格合并模型的输出为预测所述样本训练数据是否需要合并的二分类预测值。
进一步地,在本申请的其他实施例中,若所述二分类预测值为介于0和1之间的任一数值时,可根据预设的比较值0.5来判断是否需要对某一条样本训练数据进行合并,当大于或等于0.5时,确定该条样本训练数据需要进行合并;当小于0.5时;确定该条样本训练数据不需要进行合并。
例如,当所述表格合并模型根据所述样本训练数据预测得到的二分类预测值大于或等于0.5时,所述样本训练数据中的两个区块来自同一表格的概率大于或等于0.5,那么,可确定所述样本训练数据中的两个区块需要合并;或当所述表格合并模型根据所述样本训练数据预测得到的二分类预测值小于0.5时,所述样本训练数据中的两个区块来自同一表格的概率小于0.5,那么,可确定所述样本训练数据中的两个区块不需要合并。
步骤S14,获取PDF测试文档,采集所述PDF测试文档中每个页面的文本信息和位置信息,去除所述PDF测试文档中每个页面的页眉与页脚的文本信息和位置信息,根据所述每个页面的位置信息判断每个页面的底部和顶部是否存在表格,当所述页面底部与所述页面的下一页面顶部存在表格时,合并所述页面底部的表格的位置信息与所述页面的下一页面顶部的表格的位置信息,将合并结果作为跨页表格测试数据。
在本申请的至少一个实施例中,去除所述PDF测试文档中每个页面的页眉与页脚的文本信息和位置信息包括:
计算所述PDF测试文档中所有页面的页面高度均值;
取所述页面平均高度的第一分位值的高度作为页眉的候选区域,取所述页面平均高度的第二分位值的高度作为页脚的候选区域;
对所述PDF测试文档中的每个页面,提取所述页眉的候选区域中的文本信息和位置信息及所述页脚的候选区域中的文本信息和位置信息;
对所述PDF测试文档中的每个页面,计算所述页眉的候选区域中的文本与该页面前后指定数量的页面中所述页眉的候选区域中文本的第一编辑距离,并计算所述页脚的候选区域中的文本与该页面前后指定数量的页面中所述页脚的候选区域中文本的第二编辑距离;
当所述第一编辑距离小于预设的第一阈值时,判定所述候选区域中的所述文本为页眉,去除所述页眉的文本信息和位置信息,当所述第二编辑距离小于预设的第二阈值时,判定所述候选区域中的所述文本为页脚,去除所述页脚的文本信息和位置信息。
在本申请的至少一个实施方式中,所述编辑距离是针对二个字符串的差异程度的量化量测,具体地,所述编辑距离是从一个字符串通过插入、修改、删除等操作转换为另外一个字符串所需要最小的步骤。
例如,当提取所述PDF测试文档的页眉时,计算所述PDF测试文档中所有页面的页面高度均值h_mean,取h_mean的上五分之一页面部分作为页眉的候选区域,对所述PDF测试文档中的每个页面,提取所述页眉的候选区域中的文本信息和位置信息,计算所述页眉的候选区域中的文本与该页面前后各3页候选区域中文本的编辑距离,判定编辑距离小于所述第一阈值的内容为页眉,去除所述页眉的文本信息和位置信息。
又例如,当提取所述PDF测试文档的页脚时,计算所述PDF测试文档中所有页面的页面高度均值h_mean,取h_mean的下五分之一页面部分作为页脚的候选区域,对所述PDF测试文档中的每个页面,提取所述页脚的候选区域中的文本信息和位置信息,计算所述页脚的候选区域中的文本与该页面前后各3页候选区域中文本的编辑距离,判定编辑距离小于所述第二阈值的内容为页脚,去除所述页脚的文本信息和位置信息。
步骤S15,根据所述跨页表格测试数据,利用所述表格合并模型预测得到二分类预测值,所述二分类预测值用于判断所述跨页表格测试数据是否需要合并。
在本申请的至少一个实施例中,根据所述跨页表格测试数据,利用所述表格合并模型预测得到二分类预测值包括:
将所述跨页表格测试数据的格式转换为所述表格合并模型所要求的格式,并将格式转换后的跨页表格测试数据作为所述表格合并模型的输入数据;
所述表格合并模型根据所述输入数据预测所述跨页表格测试数据中的所述页面底部的表格与所述下一页面的顶部的表格需要合并的二分类预测值。
具体地,在本申请的至少一个实施例中,当所述表格合并模型对所述跨页表格测试数据预测得到的二分类预测值大于或等于0.5时,表示所述跨页表格测试数据中的页面底部的表格与下一页面顶部的表格属于同一表格,因此判断所述页面底部的表格与所述下一页面的顶部的表格需要合并;当所述二分类预测值小于0.5时,表示所述跨页表格测试数据中的页面底部的表格与下一页面顶部的表格属于不同的表格,因此判断所述页面底部的表格与所述下一页面的顶部的表格不需要合并。
步骤S16,当判断所述跨页表格测试数据需要合并时,合并所述页面底部的表格与所述下一页面顶部的表格得到结果表格,并根据指令显示所述结果表格。
在本申请的至少一个实施例中,合并所述页面底部的表格与所述下一页面顶部的表格得到结果表格,并根据指令显示所述结果表格包括:
根据提取到的所述页面底部的表格的所述位置信息和所述下一页面顶部的表格的位置信息合并所述页面底部的表格与所述下一页面顶部的表格,得到结果表格;
将所述结果表格存储为表格文件,并存储所述完整的结果表格;
当接收到指令时,显示所述结果表格。
例如,在本申请的一个实施例中,根据提取到的所述页面底部的表格的所述位置信息和所述下一页面顶部的表格的位置信息合并所述页面底部的表格与所述下一页面顶部的表格,得到结果表格可以包括:
根据提取到的所述页面底部的表格的所述位置信息和所述下一页面顶部的表格的位置信息,获得所述页面底部的表格的图片与所述下一页面顶部的表格的图片;
缩放所述页面底部的表格的图片与所述下一页面顶部的表格的图片,使得所述页面底部的表格的图片宽度与所述下一页面顶部的表格的图片宽度相同;
合并调整宽度后的所述页面底部的表格图片与所述下一页面顶部的表格图片,得到结果表格。
又例如,在本申请的另一实施例中,根据提取到的所述页面底部的表格的所述位置 信息和所述下一页面顶部的表格的位置信息合并所述页面底部的表格与所述下一页面顶部的表格,得到结果表格可以包括:
解析所述页面底部的表格的文本信息和所述下一页面顶部的表格的文本信息;
根据提取到的所述页面底部的表格的所述位置信息和所述下一页面顶部的表格的位置信息,确定与所述页面底部的表格中的列对应的所述下一页面顶部的表格的列;
根据所述页面底部的表格中的列与对应的所述下一页面顶部的表格的列,合并所述页面底部的表格中的文本信息与对应的所述下一页面顶部的表格中的文本信息,得到结果表格。
例如,当接收到指令时,显示所述结果表格可以包括:
当指令为将所述结果表格展示为单独一页文档时,从所述数据库中提取所述结果表格,根据所述文档的页面尺寸,按照所述结果表格的高度宽度比例缩放对所述结果表格的边框与文字,使所述结果表格的高度小于所述文档的高度,并使所述结果表格的宽度小于所述文档的宽度,并将所述结果表格显示在一页文档上。
又例如,当接收到指令时,显示所述结果表格可以包括:
当指令为将所述结果表格展示为单独一页文档时,从所述数据库中提取所述结果表格,根据所述文档的页面尺寸,按照所述结果表格的高度宽度比例缩放对所述结果表格的边框与文字,使所述结果表格的高度小于所述文档的高度,并使所述结果表格的宽度小于所述文档的宽度,并将所述结果表格显示在一页文档上。
需要说明的是,为保证上述处理过程中的数据和输出结果的私密性和安全性,所述处理过程中的数据和输出结果可存储于区块链中,比如所述人脸图像训练数据、所述第一特征图,所述第一几何关系矩阵、所述人脸图片测试数据、所述第二输入数据、所述人脸关键点等。
本申请获取至少两个PDF文档,并采集每个所述PDF文档中的至少一个表格,得到表格数据集;根据表格数据集生成跨页表格训练数据集;使用跨页表格训练数据集训练深度学习模型,得到表格合并模型,获取PDF测试文档,去除页眉和页脚,构建跨页表格测试数据利用表格合并模型预测跨页表格测试数据需要合并的二分类预测值,并根据二分类预测值判断跨页表格测试数据是否需要合并,合并并输出需要合并的跨页表格,可以有效地处理PDF文档中复杂表格跨页提取的任务,对判断跨页表格是否需要合并有较高的准确率。
实施例2
图2为本申请一实施方式中PDF文档跨页表格合并装置30的结构图。
在一些实施例中,所述PDF文档跨页表格合并装置30运行于电子设备中。所述PDF文档跨页表格合并装置30可以包括多个由程序代码段所组成的功能模块。所述PDF文档跨页表格合并装置30中的各个程序段的程序代码可以存储于存储器中,并由至少一个处理器所执行,以PDF文档跨页表格合并功能。
本实施例中,所述PDF文档跨页表格合并装置30根据其所执行的功能,可以被划分为多个功能模块。参阅图2所示,所述PDF文档跨页表格合并装置30可以包括表格数据获取模块301、训练数据集构造模块302、模型训练模块303、测试数据构造模块304、预测模块305及合并模块306。本申请所称的模块是指一种能够被至少一个处理器所执行并且能够完成固定功能的一系列计算机可读指令段,其存储在存储器中。所述在一些实施例中,关于各模块的功能将在后续的实施例中详述。
所述表格数据获取模块301获取至少两个包含表格的PDF文档,并采集每个所述PDF文档中的至少一个表格的位置信息和文本信息,并根据所述表格的位置信息得到表格数据集。
具体地,在本申请的至少一个实施例中,所述表格数据获取模块301采集每个所述 PDF文档中的至少一个表格的位置信息和文本信息,并根据所述表格的位置信息得到表格数据集包括:
使用pdfplumber库解析每个所述PDF文档得到每个所述PDF文档的位置信息和文本信息,并从所述位置信息中采集所述PDF文档中表格的位置信息以及所述表格中每个单元格的位置信息作为表格数据集。
具体地,在本申请的其他实施方式中,所述表格数据获取模块301采集每个所述PDF文档中的至少一个表格的位置信息和文本信息,并根据所述表格的位置信息得到表格数据集包括:
使用其他PDF内容解析库,例如pdfminer、camelot等,解析每个所述PDF文档得到每个所述PDF文档的位置信息和文本信息,并从所述位置信息中采集所述PDF文档中的表格的位置信息和文本信息以及所述表格中每个单元格的位置信息。
在本方式的一个实施例中,所述PDF文档可以是涉及不同领域的、记录各类不同信息的文档,例如:金融领域、商业领域、医疗领域等不同的技术领域;所述文本信息为除图片外的所有文本信息,所述位置信息包括:页眉、页脚、标题、正文、表格的位置信息等。
所述训练数据集构造模块302对所述表格数据集中的每个表格,随机选取每个所述表格中的一行进行划分,得到每个所述表格的上半区块的位置信息和每个所述表格的下半区块的位置信息,合并所述每个表格的上半区块的位置信息和每个所述表格的下半区块的位置信息得到正样本数据,将所述正样本数据标注为第一标记,随机选择所述每个表格的上半区块的位置信息和其他表格的上半区块的位置信息得到负样本数据,将所述负样本数据标注为第二标记,所述正样本数据与所述负样本数据组成样本训练数据,所述样本训练数据与对应的标注组成跨页表格训练数据集。
例如,所述第一标记可以是1,所述第二标记可以是0。
又例如,当所述表格数据集包括第一表格和第二表格时,所述训练数据集构造模块302根据所述表格数据集生成跨页表格训练数据集包括:
对于所述第一表格,随机选取所述第一表格中除第一行和最后一行以外的一行进行划分,得到第一表格上半区块的位置信息和第一表格下半区块的位置信息,对于所述第二表格,随机选取所述第二表格中除第一行和最后一行以外的一行进行划分,得到第二表格上半区块的位置信息和第二表格下半区块的位置信息,所述上半区块和所述下半区块为表格划分后得到的表格上半部分和表格下半部分;
合并所述第一表格上半区块的位置信息和所述第一表格下半区块的位置信息得到第一正样本数据,合并所述第二表格上半区块的位置信息和所述第二表格下半区块的位置信息得到第二正样本数据,将所述第一正样本数据和所述第二正样本数据标注为1;
合并所述第一表格上半区块的位置信息和所述第二表格上半区块的位置信息得到第一负样本数据,合并所述第二表格上半区块的位置信息和所述第一表格上半区块的位置信息得到第二负样本数据,将所述第一负样本数据和所述第二负样本数据标注为0;
所述第一正样本数据、所述第二正样本数据、所述第一负样本数据、所述第二负样本数据组成所述样本训练数据,所述样本训练数据与对应的标注组成跨页表格训练数据集。
在本申请的一个实施例中,区块的位置信息包括:区块左上角的x坐标,区块左上角的y坐标,区块的宽,区块的高,单元格左上角的x坐标,单元格左上角的y坐标,单元格的宽,单元格的高,区块中列的数量。
所述模型训练模块303构建基于深度双向变换器的预训练模型的深度学习模型,根据所述跨页表格训练数据集构造所述深度学习模型的输入数据,将所述跨页表格训练数据集中的每个所述表格中的单元格作为所述深度学习模型的输入的步长,将所述跨页表 格训练数据集中每个样本训练数据标注对应的二分类预测值作为所述深度学习模型的输出,训练并优化所述深度学习模型,得到表格合并模型。
在本申请的至少一个实施例中,根据所述跨页表格训练数据集构造所述深度学习模型的输入数据包括:
将所述跨页表格训练数据集中的样本训练数据和样本训练数据的标注构造为符合模型输入格式的数据,并作为所述深度学习模型的输入数据,其中,所述模型输入格式为[SEP]+table 1_cell 1+table 1_cell 2+…+table 1_cell m+[SEP]+table 2_cell 1+table 2_cell 2+…+table 2_cell n+[SEP],其中,table 1和table 2表示两个区块,table_cell表示由区块中单元格位置信息构成的特征,m表示table 1的单元格数量,n表示table 2的单元格数量,当m大于或等于n时,[SEP]为m个“1”组成的序列,当m小于n时,[SEP]为n个“1”组成的序列,所述table_cell内的数据为[x_t,y_t,w_t,h_t,x_t+w_t,y_t+h_t,(x_t+w_t)/h_t,(y_t+h_t)/2,x_c,y_c,w_c,h_c,x_c+w_c,y_c+h_c,(x_c+w_c)/h_c,(y_c+h_c)/2,a],其中,x_t为区块左上角的x坐标,y_t为区块左上角的y坐标,w_t为区块的宽,h_t为区块的高,x_c为单元格左上角的x坐标,y_c为单元格左上角的y坐标,w_c为单元格的宽,h_c为单元格的高,a为0或1。
具体地,所述table 1和table 2可以表示同一表格的上半区块和下半区块,也可以表示不同表格的两个上半区块;当table 1和table 2的列的数量相同时,a为1,当table 1和table 2的列的数量不同时,a为0。
在本申请的一个实施方式中,将所述跨页表格训练数据集中每个样本训练数据标注对应的二分类预测值作为所述深度学习模型的输出包括:
当所述样本训练数据的标注为所述第一标记时,所述深度学习模型的输出中[SEP]处的二分类预测值为第一预设值;
当所述样本训练数据的标注为所述第二标记时,所述深度学习模型的输出中[SEP]处的二分类预测值为第二预设值。
例如,所述第一预设值可以是1,所述第二预设值可以是0。
具体地,在本申请的至少一个实施例中,所述二分类预测值为所述样本训练数据中的两个区块为同一表格的概率,当二分类预测值为第一预设值时,表示所述样本训练数据中的两个区块来自同一表格,所述样本训练数据需要合并的概率为1,即所述样本训练数据需要合并;当所述二分类预测值为第二预设值时,表示所述样本训练数据中的两个区块来自不同表格,所述样本训练数据需要合并的概率为0,即所述样本训练数据不需要合并。
在本申请的至少一个实施例中,所述训练并优化所述深度学习模型,得到表格合并模型包括:
利用所述编码层对所述输入数据进行编码;
训练所述预测层,直至所述预测层收敛,得到所述表格合并模型,所述表格合并模型的输出为预测所述样本训练数据是否需要合并的二分类预测值。
进一步地,在本申请的其他实施例中,若所述二分类预测值为介于0和1之间的任一数值时,可根据预设的比较值0.5来判断是否需要对某一条样本训练数据进行合并,当大于或等于0.5时,确定该条样本训练数据需要进行合并;当小于0.5时;确定该条样本训练数据不需要进行合并。
所述测试数据构造模块304获取PDF测试文档,采集所述PDF测试文档中每个页面的文本信息和位置信息,去除所述PDF测试文档中每个页面的页眉与页脚的文本信息和位置信息,根据所述每个页面的位置信息判断每个页面的底部和顶部是否存在表格,当所述页面底部与所述页面的下一页面顶部存在表格时,合并所述页面底部的表格的位置信息与所述页面的下一页面顶部的表格的位置信息,将合并结果作为跨页表格测试数据。
在本申请的至少一个实施例中,去除所述PDF测试文档中每个页面的页眉与页脚的文本信息和位置信息包括:
计算所述PDF测试文档中所有页面的页面高度均值;
取所述页面平均高度的第一分位值的高度作为页眉的候选区域,取所述页面平均高度的第二分位值的高度作为页脚的候选区域;
对所述PDF测试文档中的每个页面,提取所述页眉的候选区域中的文本信息和位置信息及所述页脚的候选区域中的文本信息和位置信息;
对所述PDF测试文档中的每个页面,计算所述页眉的候选区域中的文本与该页面前后指定数量的页面中所述页眉的候选区域中文本的第一编辑距离,并计算所述页脚的候选区域中的文本与该页面前后指定数量的页面中所述页脚的候选区域中文本的第二编辑距离;
当所述第一编辑距离小于预设的第一阈值时,判定所述候选区域中的所述文本为页眉,去除所述页眉的文本信息和位置信息,当所述第二编辑距离小于预设的第二阈值时,判定所述候选区域中的所述文本为页脚,去除所述页脚的文本信息和位置信息。
在本申请的至少一个实施方式中,所述编辑距离是针对二个字符串的差异程度的量化量测,具体地,所述编辑距离是从一个字符串通过插入、修改、删除等操作转换为另外一个字符串所需要最小的步骤。
所述预测模块305根据所述跨页表格测试数据,利用所述表格合并模型预测得到二分类预测值,所述二分类预测值用于判断所述跨页表格测试数据是否需要合并。
在本申请的至少一个实施例中,根据所述跨页表格测试数据,利用所述表格合并模型预测得到二分类预测值包括:
将所述跨页表格测试数据的格式转换为所述表格合并模型所要求的格式,并将格式转换后的跨页表格测试数据作为所述表格合并模型的输入数据;
所述表格合并模型根据所述输入数据预测所述跨页表格测试数据中的所述页面底部的表格与所述下一页面的顶部的表格需要合并的二分类预测值。
具体地,在本申请的至少一个实施例中,当所述表格合并模型对所述跨页表格测试数据预测得到的二分类预测值大于或等于0.5时,表示所述跨页表格测试数据中的页面底部的表格与下一页面顶部的表格属于同一表格,因此判断所述页面底部的表格与所述下一页面的顶部的表格需要合并;当所述二分类预测值小于0.5时,表示所述跨页表格测试数据中的页面底部的表格与下一页面顶部的表格属于不同的表格,因此判断所述页面底部的表格与所述下一页面的顶部的表格不需要合并。
所述合并模块306当判断所述跨页表格测试数据需要合并时,合并所述页面底部的表格与所述下一页面顶部的表格得到结果表格,并根据指令显示所述结果表格。
在本申请的至少一个实施例中,合并所述页面底部的表格与所述下一页面顶部的表格得到结果表格,并根据指令显示所述结果表格包括:
根据提取到的所述页面底部的表格的所述位置信息和所述下一页面顶部的表格的位置信息合并所述页面底部的表格与所述下一页面顶部的表格,得到结果表格;
将所述结果表格存储为表格文件,并存储所述完整的结果表格;
当接收到指令时,显示所述结果表格。
本申请获取至少两个PDF文档,并采集每个所述PDF文档中的至少一个表格,得到表格数据集;根据表格数据集生成跨页表格训练数据集;使用跨页表格训练数据集训练深度学习模型,得到表格合并模型,获取PDF测试文档,去除页眉和页脚,构建跨页表格测试数据利用表格合并模型预测跨页表格测试数据需要合并的二分类预测值,并根据二分类预测值判断跨页表格测试数据是否需要合并,合并并输出需要合并的跨页表格,可以有效地处理PDF文档中复杂表格跨页提取的任务,对判断跨页表格是否需要合并有 较高的准确率。
实施例3
图3为本申请一实施方式中电子设备6的示意图。
所述电子设备6包括存储器61、处理器62以及存储在所述存储器61中并可在所述处理器62上运行的计算机可读指令。所述处理器62执行所述计算机可读指令时实现上述PDF文档跨页表格合并方法实施例中的步骤,例如图1所示的步骤S11~S16。或者,所述处理器62执行所述计算机可读指令时实现上述PDF文档跨页表格合并装置实施例中各模块/单元的功能,例如图2中的模块301~306。
示例性的,所述计算机可读指令可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器61中,并由所述处理器62执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令段,所述指令段用于描述所述计算机可读指令在所述电子设备6中的执行过程。例如,所述计算机可读指令可以被分割成图2中的表格数据获取模块301、训练数据集构造模块302、模型训练模块303、测试数据构造模块304、预测模块305及合并模块306,各模块具体功能参见实施例2。
本实施方式中,所述电子设备6可以是桌上型计算机、笔记本、掌上电脑、服务器及云端终端装置等计算设备。本领域技术人员可以理解,所述示意图仅仅是电子设备6的示例,并不构成对电子设备6的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述电子设备6还可以包括输入输出设备、网络接入设备、总线等。
所称处理器62可以是中央处理模块(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者所述处理器62也可以是任何常规的处理器等,所述处理器62是所述电子设备6的控制中心,利用各种接口和线路连接整个电子设备6的各个部分。
所述存储器61可用于存储所述计算机可读指令和/或模块/单元,所述处理器62通过运行或执行存储在所述存储器61内的计算机可读指令和/或模块/单元,以及调用存储在存储器61内的数据,实现所述电子设备6的各种功能。所述存储器61可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备6的使用所创建的数据等。此外,存储器61可以包括易失性存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他存储器件。
所述电子设备6集成的模块/单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一计算机可读存储介质中,所述计算机可读指令在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机可读指令包括计算机可读指令代码,所述计算机可读指令代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机可读指令代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)等。
本申请所述计算机可读存储介质可以是非易失性,也可以是易失性。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每个个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
另外,在本申请各个实施例中的各功能模块可以集成在相同处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在相同模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附图标注视为限制所涉及的权利要求。此外,显然“包括”一词不排除其他模块或步骤,单数不排除复数。本申请中陈述的多个模块或电子设备也可以由同一个模块或电子设备通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。

Claims (20)

  1. 一种PDF文档跨页表格合并方法,其中,所述PDF文档跨页表格合并方法包括:
    获取至少两个包含表格的PDF文档,并采集每个所述PDF文档中的至少一个表格的位置信息和文本信息,并根据所述表格的位置信息得到表格数据集;
    对所述表格数据集中的每个表格,随机选取每个所述表格中的一行进行划分,得到每个所述表格的上半区块的位置信息和每个所述表格的下半区块的位置信息,合并所述每个表格的上半区块的位置信息和每个所述表格的下半区块的位置信息得到正样本数据,将所述正样本数据标注为第一标记,随机选择所述每个表格的上半区块的位置信息和其他表格的上半区块的位置信息得到负样本数据,将所述负样本数据标注为第二标记,所述正样本数据与所述负样本数据组成样本训练数据,所述样本训练数据与对应的标注组成跨页表格训练数据集;
    构建基于深度双向变换器的预训练模型的深度学习模型,根据所述跨页表格训练数据集构造所述深度学习模型的输入数据,将所述跨页表格训练数据集中的每个所述表格中的单元格作为所述深度学习模型的输入的步长,将所述跨页表格训练数据集中每个样本训练数据标注对应的二分类预测值作为所述深度学习模型的输出,训练并优化所述深度学习模型,得到表格合并模型;
    获取PDF测试文档,采集所述PDF测试文档中每个页面的文本信息和位置信息,去除所述PDF测试文档中每个页面的页眉与页脚的文本信息和位置信息,根据所述每个页面的位置信息判断每个页面的底部和顶部是否存在表格,当所述页面底部与所述页面的下一页面顶部存在表格时,合并所述页面底部的表格的位置信息与所述页面的下一页面顶部的表格的位置信息,将合并结果作为跨页表格测试数据;
    根据所述跨页表格测试数据,利用所述表格合并模型预测得到二分类预测值,所述二分类预测值用于判断所述跨页表格测试数据是否需要合并;
    当判断所述跨页表格测试数据需要合并时,合并所述页面底部的表格与所述下一页面顶部的表格得到结果表格,并根据指令显示所述结果表格。
  2. 如权利要求1所述的PDF文档跨页表格合并方法,其中,所述根据所述跨页表格训练数据集构造所述深度学习模型的输入数据包括:
    将所述跨页表格训练数据集中的样本训练数据和样本训练数据的标注构造为符合模型输入格式的数据,并作为所述深度学习模型的输入数据,其中,所述模型输入格式为[SEP]+table 1_cell 1+table 1_cell 2+…+table 1_cell m+[SEP]+table 2_cell 1+table 2_cell 2+…+table 2_cell n+[SEP],其中,table 1和table 2表示两个区块,table_cell表示由区块中单元格位置信息构成的特征,m表示table 1的单元格数量,n表示table 2的单元格数量,当m大于或等于n时,[SEP]为m个“1”组成的序列,当m小于n时,[SEP]为n个“1”组成的序列,所述table_cell内的数据为[x_t,y_t,w_t,h_t,x_t+w_t,y_t+h_t,(x_t+w_t)/h_t,(y_t+h_t)/2,x_c,y_c,w_c,h_c,x_c+w_c,y_c+h_c,(x_c+w_c)/h_c,(y_c+h_c)/2,a],其中,x_t为区块左上角的x坐标,y_t为区块左上角的y坐标,w_t为区块的宽,h_t为区块的高,x_c为单元格左上角的x坐标,y_c为单元格左上角的y坐标,w_c为单元格的宽,h_c为单元格的高,a为0或1。
  3. 如权利要求2所述的PDF文档跨页表格合并方法,其中,所述将所述跨页表格训练数据集中每个样本训练数据标注对应的二分类预测值作为所述深度学习模型的输出包括:
    当所述样本训练数据的标注为所述第一标记时,所述深度学习模型的输出中[SEP]处的二分类预测值为第一预设值;
    当所述样本训练数据的标注为所述第二标记时,所述深度学习模型的输出中[SEP]处 的二分类预测值为第二预设值。
  4. 如权利要求1所述的PDF文档跨页表格合并方法,其中,所述训练并优化所述深度学习模型,得到表格合并模型包括:
    利用所述深度学习模型的编码层对所述输入数据进行编码;
    训练所述深度学习模型的预测层,直至所述深度学习模型的预测层收敛,得到所述表格合并模型,所述表格合并模型的输出为预测所述样本训练数据是否需要合并的二分类预测值。
  5. 如权利要求1所述的PDF文档跨页表格合并方法,其中,所述去除所述PDF测试文档中每个页面的页眉与页脚的文本信息和位置信息包括:
    计算所述PDF测试文档中所有页面的页面高度均值;
    取所述页面平均高度的第一分位值的高度作为页眉的候选区域,取所述页面平均高度的第二分位值的高度作为页脚的候选区域;
    对所述PDF测试文档中的每个页面,提取所述页眉的候选区域中的文本信息和位置信息及所述页脚的候选区域中的文本信息和位置信息;
    对所述PDF测试文档中的每个页面,计算所述页眉的候选区域中的文本与该页面前后指定数量的页面中所述页眉的候选区域中文本的第一编辑距离,并计算所述页脚的候选区域中的文本与该页面前后指定数量的页面中所述页脚的候选区域中文本的第二编辑距离;
    当所述第一编辑距离小于预设的第一阈值时,判定所述候选区域中的所述文本为页眉,去除所述页眉的文本信息和位置信息,当所述第二编辑距离小于预设的第二阈值时,判定所述候选区域中的所述文本为页脚,去除所述页脚的文本信息和位置信息。
  6. 如权利要求1所述的PDF文档跨页表格合并方法,其中,所述根据所述跨页表格测试数据,利用所述表格合并模型预测得到二分类预测值包括:
    将所述跨页表格测试数据的格式转换为所述表格合并模型所要求的格式,并将格式转换后的跨页表格测试数据作为所述表格合并模型的输入数据;
    所述表格合并模型根据所述输入数据预测所述跨页表格测试数据中的所述页面底部的表格与所述下一页面的顶部的表格需要合并的二分类预测值。
  7. 如权利要求1所述的PDF文档跨页表格合并方法,其中,所述合并所述页面底部的表格与所述下一页面顶部的表格得到结果表格,并根据指令显示所述结果表格包括:
    根据提取到的所述页面底部的表格的所述位置信息和所述下一页面顶部的表格的位置信息合并所述页面底部的表格与所述下一页面顶部的表格,得到结果表格;
    将所述结果表格存储为表格文件,并存储所述完整的结果表格;
    当接收到指令时,显示所述结果表格。
  8. 一种电子设备,其中,所述电子设备包括存储器及处理器,所述存储器用于存储至少一个计算机可读指令,所述处理器用于执行所述至少一个计算机可读指令以实现以下步骤:
    获取至少两个包含表格的PDF文档,并采集每个所述PDF文档中的至少一个表格的位置信息和文本信息,并根据所述表格的位置信息得到表格数据集;
    对所述表格数据集中的每个表格,随机选取每个所述表格中的一行进行划分,得到每个所述表格的上半区块的位置信息和每个所述表格的下半区块的位置信息,合并所述每个表格的上半区块的位置信息和每个所述表格的下半区块的位置信息得到正样本数据,将所述正样本数据标注为第一标记,随机选择所述每个表格的上半区块的位置信息和其他表格的上半区块的位置信息得到负样本数据,将所述负样本数据标注为第二标记,所述正样本数据与所述负样本数据组成样本训练数据,所述样本训练数据与对应的标注组成跨页表格训练数据集;
    构建基于深度双向变换器的预训练模型的深度学习模型,根据所述跨页表格训练数据集构造所述深度学习模型的输入数据,将所述跨页表格训练数据集中的每个所述表格中的单元格作为所述深度学习模型的输入的步长,将所述跨页表格训练数据集中每个样本训练数据标注对应的二分类预测值作为所述深度学习模型的输出,训练并优化所述深度学习模型,得到表格合并模型;
    获取PDF测试文档,采集所述PDF测试文档中每个页面的文本信息和位置信息,去除所述PDF测试文档中每个页面的页眉与页脚的文本信息和位置信息,根据所述每个页面的位置信息判断每个页面的底部和顶部是否存在表格,当所述页面底部与所述页面的下一页面顶部存在表格时,合并所述页面底部的表格的位置信息与所述页面的下一页面顶部的表格的位置信息,将合并结果作为跨页表格测试数据;
    根据所述跨页表格测试数据,利用所述表格合并模型预测得到二分类预测值,所述二分类预测值用于判断所述跨页表格测试数据是否需要合并;
    当判断所述跨页表格测试数据需要合并时,合并所述页面底部的表格与所述下一页面顶部的表格得到结果表格,并根据指令显示所述结果表格。
  9. 如权利要求8所述的电子设备,其中,所述处理器执行所述至少一个计算机可读指令以实现所述根据所述跨页表格训练数据集构造所述深度学习模型的输入数据时,具体包括:
    将所述跨页表格训练数据集中的样本训练数据和样本训练数据的标注构造为符合模型输入格式的数据,并作为所述深度学习模型的输入数据,其中,所述模型输入格式为[SEP]+table 1_cell 1+table 1_cell 2+…+table 1_cell m+[SEP]+table 2_cell 1+table 2_cell 2+…+table 2_cell n+[SEP],其中,table 1和table 2表示两个区块,table_cell表示由区块中单元格位置信息构成的特征,m表示table 1的单元格数量,n表示table 2的单元格数量,当m大于或等于n时,[SEP]为m个“1”组成的序列,当m小于n时,[SEP]为n个“1”组成的序列,所述table_cell内的数据为[x_t,y_t,w_t,h_t,x_t+w_t,y_t+h_t,(x_t+w_t)/h_t,(y_t+h_t)/2,x_c,y_c,w_c,h_c,x_c+w_c,y_c+h_c,(x_c+w_c)/h_c,(y_c+h_c)/2,a],其中,x_t为区块左上角的x坐标,y_t为区块左上角的y坐标,w_t为区块的宽,h_t为区块的高,x_c为单元格左上角的x坐标,y_c为单元格左上角的y坐标,w_c为单元格的宽,h_c为单元格的高,a为0或1。
  10. 如权利要求9所述的电子设备,其中,所述处理器执行所述至少一个计算机可读指令以实现所述将所述跨页表格训练数据集中每个样本训练数据标注对应的二分类预测值作为所述深度学习模型的输出时,具体包括:
    当所述样本训练数据的标注为所述第一标记时,所述深度学习模型的输出中[SEP]处的二分类预测值为第一预设值;
    当所述样本训练数据的标注为所述第二标记时,所述深度学习模型的输出中[SEP]处的二分类预测值为第二预设值。
  11. 如权利要求8所述的电子设备,其中,所述处理器执行所述至少一个计算机可读指令以实现所述训练并优化所述深度学习模型,得到表格合并模型时,具体包括:
    利用所述深度学习模型的编码层对所述输入数据进行编码;
    训练所述深度学习模型的预测层,直至所述深度学习模型的预测层收敛,得到所述表格合并模型,所述表格合并模型的输出为预测所述样本训练数据是否需要合并的二分类预测值。
  12. 如权利要求8所述的电子设备,其中,所述处理器执行所述至少一个计算机可读指令以实现所述去除所述PDF测试文档中每个页面的页眉与页脚的文本信息和位置信息时,具体包括:
    计算所述PDF测试文档中所有页面的页面高度均值;
    取所述页面平均高度的第一分位值的高度作为页眉的候选区域,取所述页面平均高度的第二分位值的高度作为页脚的候选区域;
    对所述PDF测试文档中的每个页面,提取所述页眉的候选区域中的文本信息和位置信息及所述页脚的候选区域中的文本信息和位置信息;
    对所述PDF测试文档中的每个页面,计算所述页眉的候选区域中的文本与该页面前后指定数量的页面中所述页眉的候选区域中文本的第一编辑距离,并计算所述页脚的候选区域中的文本与该页面前后指定数量的页面中所述页脚的候选区域中文本的第二编辑距离;
    当所述第一编辑距离小于预设的第一阈值时,判定所述候选区域中的所述文本为页眉,去除所述页眉的文本信息和位置信息,当所述第二编辑距离小于预设的第二阈值时,判定所述候选区域中的所述文本为页脚,去除所述页脚的文本信息和位置信息。
  13. 如权利要求8所述的电子设备,其中,所述处理器执行所述至少一个计算机可读指令以实现所述根据所述跨页表格测试数据,利用所述表格合并模型预测得到二分类预测值时,具体包括:
    将所述跨页表格测试数据的格式转换为所述表格合并模型所要求的格式,并将格式转换后的跨页表格测试数据作为所述表格合并模型的输入数据;
    所述表格合并模型根据所述输入数据预测所述跨页表格测试数据中的所述页面底部的表格与所述下一页面的顶部的表格需要合并的二分类预测值。
  14. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行时实现以下步骤:
    获取至少两个包含表格的PDF文档,并采集每个所述PDF文档中的至少一个表格的位置信息和文本信息,并根据所述表格的位置信息得到表格数据集;
    对所述表格数据集中的每个表格,随机选取每个所述表格中的一行进行划分,得到每个所述表格的上半区块的位置信息和每个所述表格的下半区块的位置信息,合并所述每个表格的上半区块的位置信息和每个所述表格的下半区块的位置信息得到正样本数据,将所述正样本数据标注为第一标记,随机选择所述每个表格的上半区块的位置信息和其他表格的上半区块的位置信息得到负样本数据,将所述负样本数据标注为第二标记,所述正样本数据与所述负样本数据组成样本训练数据,所述样本训练数据与对应的标注组成跨页表格训练数据集;
    构建基于深度双向变换器的预训练模型的深度学习模型,根据所述跨页表格训练数据集构造所述深度学习模型的输入数据,将所述跨页表格训练数据集中的每个所述表格中的单元格作为所述深度学习模型的输入的步长,将所述跨页表格训练数据集中每个样本训练数据标注对应的二分类预测值作为所述深度学习模型的输出,训练并优化所述深度学习模型,得到表格合并模型;
    获取PDF测试文档,采集所述PDF测试文档中每个页面的文本信息和位置信息,去除所述PDF测试文档中每个页面的页眉与页脚的文本信息和位置信息,根据所述每个页面的位置信息判断每个页面的底部和顶部是否存在表格,当所述页面底部与所述页面的下一页面顶部存在表格时,合并所述页面底部的表格的位置信息与所述页面的下一页面顶部的表格的位置信息,将合并结果作为跨页表格测试数据;
    根据所述跨页表格测试数据,利用所述表格合并模型预测得到二分类预测值,所述二分类预测值用于判断所述跨页表格测试数据是否需要合并;
    当判断所述跨页表格测试数据需要合并时,合并所述页面底部的表格与所述下一页面顶部的表格得到结果表格,并根据指令显示所述结果表格。
  15. 如权利要求14所述的存储介质,其中,所述至少一个计算机可读指令被所述处理器执行以实现所述根据所述跨页表格训练数据集构造所述深度学习模型的输入数据时, 具体包括:
    将所述跨页表格训练数据集中的样本训练数据和样本训练数据的标注构造为符合模型输入格式的数据,并作为所述深度学习模型的输入数据,其中,所述模型输入格式为[SEP]+table 1_cell 1+table 1_cell 2+…+table 1_cell m+[SEP]+table 2_cell 1+table 2_cell 2+…+table 2_cell n+[SEP],其中,table 1和table 2表示两个区块,table_cell表示由区块中单元格位置信息构成的特征,m表示table 1的单元格数量,n表示table 2的单元格数量,当m大于或等于n时,[SEP]为m个“1”组成的序列,当m小于n时,[SEP]为n个“1”组成的序列,所述table_cell内的数据为[x_t,y_t,w_t,h_t,x_t+w_t,y_t+h_t,(x_t+w_t)/h_t,(y_t+h_t)/2,x_c,y_c,w_c,h_c,x_c+w_c,y_c+h_c,(x_c+w_c)/h_c,(y_c+h_c)/2,a],其中,x_t为区块左上角的x坐标,y_t为区块左上角的y坐标,w_t为区块的宽,h_t为区块的高,x_c为单元格左上角的x坐标,y_c为单元格左上角的y坐标,w_c为单元格的宽,h_c为单元格的高,a为0或1。
  16. 如权利要求15所述的存储介质,其中,所述至少一个计算机可读指令被所述处理器执行以实现所述将所述跨页表格训练数据集中每个样本训练数据标注对应的二分类预测值作为所述深度学习模型的输出时,具体包括:
    当所述样本训练数据的标注为所述第一标记时,所述深度学习模型的输出中[SEP]处的二分类预测值为第一预设值;
    当所述样本训练数据的标注为所述第二标记时,所述深度学习模型的输出中[SEP]处的二分类预测值为第二预设值。
  17. 如权利要求14所述的存储介质,其中,所述至少一个计算机可读指令被所述处理器执行以实现所述训练并优化所述深度学习模型,得到表格合并模型时,具体包括:
    利用所述深度学习模型的编码层对所述输入数据进行编码;
    训练所述深度学习模型的预测层,直至所述深度学习模型的预测层收敛,得到所述表格合并模型,所述表格合并模型的输出为预测所述样本训练数据是否需要合并的二分类预测值。
  18. 如权利要求14所述的存储介质,其中,所述至少一个计算机可读指令被所述处理器执行以实现所述去除所述PDF测试文档中每个页面的页眉与页脚的文本信息和位置信息时,具体包括:
    计算所述PDF测试文档中所有页面的页面高度均值;
    取所述页面平均高度的第一分位值的高度作为页眉的候选区域,取所述页面平均高度的第二分位值的高度作为页脚的候选区域;
    对所述PDF测试文档中的每个页面,提取所述页眉的候选区域中的文本信息和位置信息及所述页脚的候选区域中的文本信息和位置信息;
    对所述PDF测试文档中的每个页面,计算所述页眉的候选区域中的文本与该页面前后指定数量的页面中所述页眉的候选区域中文本的第一编辑距离,并计算所述页脚的候选区域中的文本与该页面前后指定数量的页面中所述页脚的候选区域中文本的第二编辑距离;
    当所述第一编辑距离小于预设的第一阈值时,判定所述候选区域中的所述文本为页眉,去除所述页眉的文本信息和位置信息,当所述第二编辑距离小于预设的第二阈值时,判定所述候选区域中的所述文本为页脚,去除所述页脚的文本信息和位置信息。
  19. 如权利要求14所述的存储介质,其中,所述至少一个计算机可读指令被所述处理器执行以实现所述根据所述跨页表格测试数据,利用所述表格合并模型预测得到二分类预测值时,具体包括:
    将所述跨页表格测试数据的格式转换为所述表格合并模型所要求的格式,并将格式转换后的跨页表格测试数据作为所述表格合并模型的输入数据;
    所述表格合并模型根据所述输入数据预测所述跨页表格测试数据中的所述页面底部的表格与所述下一页面的顶部的表格需要合并的二分类预测值。
  20. 一种PDF文档跨页表格合并装置,其中,所述PDF文档跨页表格合并装置包括:
    表格数据获取模块,用于获取至少两个包含表格的PDF文档,并采集每个所述PDF文档中的至少一个表格的位置信息和文本信息,并根据所述表格的位置信息得到表格数据集;
    训练数据集构造模块,用于对所述表格数据集中的每个表格,随机选取每个所述表格中的一行进行划分,得到每个所述表格的上半区块的位置信息和每个所述表格的下半区块的位置信息,合并所述每个表格的上半区块的位置信息和每个所述表格的下半区块的位置信息得到正样本数据,将所述正样本数据标注为第一标记,随机选择所述每个表格的上半区块的位置信息和其他表格的上半区块的位置信息得到负样本数据,将所述负样本数据标注为第二标记,所述正样本数据与所述负样本数据组成样本训练数据,所述样本训练数据与对应的标注组成跨页表格训练数据集;
    模型训练模块,用于构建基于深度双向变换器的预训练模型的深度学习模型,根据所述跨页表格训练数据集构造所述深度学习模型的输入数据,将所述跨页表格训练数据集中的每个所述表格中的单元格作为所述深度学习模型的输入的步长,将所述跨页表格训练数据集中每个样本训练数据标注对应的二分类预测值作为所述深度学习模型的输出,训练并优化所述深度学习模型,得到表格合并模型;
    测试数据构造模块,用于获取PDF测试文档,采集所述PDF测试文档中每个页面的文本信息和位置信息,去除所述PDF测试文档中每个页面的页眉与页脚的文本信息和位置信息,根据所述每个页面的位置信息判断每个页面的底部和顶部是否存在表格,当所述页面底部与所述页面的下一页面顶部存在表格时,合并所述页面底部的表格的位置信息与所述页面的下一页面顶部的表格的位置信息,将合并结果作为跨页表格测试数据;
    预测模块,用于根据所述跨页表格测试数据,利用所述表格合并模型预测得到二分类预测值,所述二分类预测值用于判断所述跨页表格测试数据是否需要合并;
    合并模块,用于当判断所述跨页表格测试数据需要合并时,合并所述页面底部的表格与所述下一页面顶部的表格得到结果表格,并根据指令显示所述结果表格。
PCT/CN2021/096636 2020-11-17 2021-05-28 Pdf文档跨页表格合并方法、装置、电子设备及存储介质 WO2022105172A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011290521.7 2020-11-17
CN202011290521.7A CN112380825B (zh) 2020-11-17 2020-11-17 Pdf文档跨页表格合并方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022105172A1 true WO2022105172A1 (zh) 2022-05-27

Family

ID=74585013

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/096636 WO2022105172A1 (zh) 2020-11-17 2021-05-28 Pdf文档跨页表格合并方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN112380825B (zh)
WO (1) WO2022105172A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117496545A (zh) * 2024-01-02 2024-02-02 物产中大数字科技有限公司 一种面向pdf文档的表格数据融合处理方法及装置

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112380825B (zh) * 2020-11-17 2022-07-15 平安科技(深圳)有限公司 Pdf文档跨页表格合并方法、装置、电子设备及存储介质
CN113362026A (zh) * 2021-06-04 2021-09-07 北京金山数字娱乐科技有限公司 文本处理方法及装置
CN113761833A (zh) * 2021-08-16 2021-12-07 联想(北京)有限公司 一种文档内容的显示方法、装置及设备
CN115344718B (zh) * 2022-07-13 2023-06-13 北京庖丁科技有限公司 跨区域文档内容识别方法、装置、设备、介质和程序产品

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090276693A1 (en) * 2008-05-02 2009-11-05 Canon Kabushiki Kaisha Document processing apparatus and document processing method
CN107818075A (zh) * 2017-10-16 2018-03-20 平安科技(深圳)有限公司 表格信息结构化提取方法、电子设备及计算机可读存储介质
CN107844468A (zh) * 2017-10-16 2018-03-27 平安科技(深圳)有限公司 表格信息跨页识别方法、电子设备及计算机可读存储介质
CN109635268A (zh) * 2018-12-29 2019-04-16 南京吾道知信信息技术有限公司 Pdf文件中表格信息的提取方法
CN111027297A (zh) * 2019-12-23 2020-04-17 海南港澳资讯产业股份有限公司 一种对图像型pdf财务数据关键表格信息的处理方法
CN112380825A (zh) * 2020-11-17 2021-02-19 平安科技(深圳)有限公司 Pdf文档跨页表格合并方法、装置、电子设备及存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9430453B1 (en) * 2012-12-19 2016-08-30 Emc Corporation Multi-page document recognition in document capture
US20200234003A1 (en) * 2017-02-27 2020-07-23 Alex Bakman Method, system and apparatus for generating, editing, and deploying native mobile apps and utilizing deep learning for instant digital conversion
CN110348294B (zh) * 2019-05-30 2024-04-16 平安科技(深圳)有限公司 Pdf文档中图表的定位方法、装置及计算机设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090276693A1 (en) * 2008-05-02 2009-11-05 Canon Kabushiki Kaisha Document processing apparatus and document processing method
CN107818075A (zh) * 2017-10-16 2018-03-20 平安科技(深圳)有限公司 表格信息结构化提取方法、电子设备及计算机可读存储介质
CN107844468A (zh) * 2017-10-16 2018-03-27 平安科技(深圳)有限公司 表格信息跨页识别方法、电子设备及计算机可读存储介质
CN109635268A (zh) * 2018-12-29 2019-04-16 南京吾道知信信息技术有限公司 Pdf文件中表格信息的提取方法
CN111027297A (zh) * 2019-12-23 2020-04-17 海南港澳资讯产业股份有限公司 一种对图像型pdf财务数据关键表格信息的处理方法
CN112380825A (zh) * 2020-11-17 2021-02-19 平安科技(深圳)有限公司 Pdf文档跨页表格合并方法、装置、电子设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117496545A (zh) * 2024-01-02 2024-02-02 物产中大数字科技有限公司 一种面向pdf文档的表格数据融合处理方法及装置
CN117496545B (zh) * 2024-01-02 2024-03-15 物产中大数字科技有限公司 一种面向pdf文档的表格数据融合处理方法及装置

Also Published As

Publication number Publication date
CN112380825B (zh) 2022-07-15
CN112380825A (zh) 2021-02-19

Similar Documents

Publication Publication Date Title
WO2022105172A1 (zh) Pdf文档跨页表格合并方法、装置、电子设备及存储介质
US9495347B2 (en) Systems and methods for extracting table information from documents
WO2022105115A1 (zh) 问答对匹配方法、装置、电子设备及存储介质
US10789461B1 (en) Automated systems and methods for textual extraction of relevant data elements from an electronic clinical document
CN106649223A (zh) 基于自然语言处理的金融报告自动生成方法
CN111680634A (zh) 公文文件处理方法、装置、计算机设备及存储介质
CN112949443B (zh) 表格结构识别方法、装置、电子设备及存储介质
Chou et al. Integrating XBRL data with textual information in Chinese: A semantic web approach
US10255261B2 (en) Method and apparatus for extracting areas
US20200293528A1 (en) Systems and methods for automatically generating structured output documents based on structural rules
Sabharwal et al. An intelligent literature review: adopting inductive approach to define machine learning applications in the clinical domain
JP2019032704A (ja) 表データ構造化システムおよび表データ構造化方法
US20230385559A1 (en) Automated methods and systems for retrieving information from scanned documents
US20160321247A1 (en) Gender and name translation from a first to a second language
CN111930976B (zh) 演示文稿生成方法、装置、设备及存储介质
CN112819305A (zh) 业务指标分析方法、装置、设备及存储介质
US20230023636A1 (en) Methods and systems for preparing unstructured data for statistical analysis using electronic characters
CN116402166A (zh) 一种预测模型的训练方法、装置、电子设备及存储介质
CN113886538B (zh) 医保报销信息查询方法、装置、电子设备和存储介质
CN111933241B (zh) 医疗数据解析方法、装置、电子设备及存储介质
CN113420042A (zh) 基于演示文稿的数据统计方法、装置、设备及存储介质
Dahl et al. Applications of machine learning in tabular document digitisation
CN112712866A (zh) 一种确定文本信息相似度的方法及装置
CN112667721A (zh) 数据分析方法、装置、设备及存储介质
CN112529743A (zh) 合同要素抽取方法、装置、电子设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21893327

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21893327

Country of ref document: EP

Kind code of ref document: A1