CN115759020A - Form information extraction method, form template configuration method and electronic equipment - Google Patents

Form information extraction method, form template configuration method and electronic equipment Download PDF

Info

Publication number
CN115759020A
CN115759020A CN202211435725.4A CN202211435725A CN115759020A CN 115759020 A CN115759020 A CN 115759020A CN 202211435725 A CN202211435725 A CN 202211435725A CN 115759020 A CN115759020 A CN 115759020A
Authority
CN
China
Prior art keywords
document
boundary
page
determining
row
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211435725.4A
Other languages
Chinese (zh)
Inventor
王健
袁野
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Hongji Information Technology Co Ltd
Original Assignee
Shanghai Hongji Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Hongji Information Technology Co Ltd filed Critical Shanghai Hongji Information Technology Co Ltd
Priority to CN202211435725.4A priority Critical patent/CN115759020A/en
Publication of CN115759020A publication Critical patent/CN115759020A/en
Pending legal-status Critical Current

Links

Images

Abstract

The application provides a form information extraction method, a form template configuration method and electronic equipment, wherein the method comprises the following steps: acquiring a target form template corresponding to a document to be processed, wherein the document to be processed comprises a form in a target format; determining a boundary mark of the table in the target format according to the target table template; and determining the target table content in the document to be processed according to the boundary mark.

Description

Form information extraction method, form template configuration method and electronic equipment
Technical Field
The present application relates to the field of information extraction technologies, and in particular, to a form information extraction method, a form template configuration method, and an electronic device.
Background
An important task involved in the processing of documents is the extraction of tables in the document. For the framed table, the extraction of the table information can be realized by using a table recognition algorithm to recognize the table and recognize the characters. However, currently, not all tables are tables with complete frames, and some tables are frameless tables, and each item of information is not in one solid line grid, which causes difficulty in information extraction for such tables.
Disclosure of Invention
The application aims to provide a form information extraction method, a form template configuration method and electronic equipment so as to solve the problem of high difficulty in information extraction in a frameless table.
In a first aspect, the present invention provides a form information extraction method, including: acquiring a target form template corresponding to a document to be processed, wherein the document to be processed comprises a form in a target format; determining the boundary marks of the forms in the target format according to the target form template; and determining the target table content in the document to be processed according to the boundary mark.
In the method provided by the embodiment of the application, by configuring the form template, the boundary mark of the form in the target format can be extracted based on the form template, so that the form in the target format contained in the document to be processed can be determined in the document to be processed based on the boundary mark, the identification and extraction of the form are not limited to the framed form, and the application scenario of form information extraction is improved.
In an alternative embodiment, the boundary marker comprises: document boundary markers and in-page boundary markers;
determining the target table content in the document to be processed according to the boundary mark, wherein the determining comprises the following steps: determining a form coverage area in the document to be processed according to the document boundary mark; extracting a table area in each page in the table coverage area according to the page inner boundary mark; and determining the content of the target table according to the table area of each page.
In the above embodiment, by setting two types of boundary marks, the region covered by the table in the document can be extracted through the document boundary mark, and then the effective table region in each page is screened out based on the page inner boundary mark, so that the table content can be screened out more accurately.
In an alternative embodiment, the document boundary marker includes: a document upper boundary marker and a document lower boundary marker;
determining a form coverage area in the document to be processed according to the document boundary mark, wherein the determining comprises the following steps: determining a form start bit in the document to be processed according to the document upper boundary mark; and determining a form ending position in the document to be processed according to the document lower boundary mark, wherein a form covering area is formed between the form starting position and the form ending position.
In an alternative embodiment, the document boundary marker includes: marking the upper boundary of the document;
determining a form coverage area in the document to be processed according to the document boundary mark, wherein the determining comprises the following steps: determining a form start bit in the document to be processed according to the document upper boundary mark; and determining the end position of the document to be processed as a form end position, wherein a form coverage area is formed between the form start position and the form end position.
In the above embodiments, the table coverage area can be better located by the effect of the upper and lower document boundary markers. Further, the end of the document can be used as a table termination bit for the non-document lower boundary mark, so that the change of the document can be better adapted, and the flexibility of extracting the table information in the document is improved.
In an optional embodiment, the determining, according to the document boundary marker, a table coverage area in the document to be processed includes:
performing text recognition on the document to be processed to determine a text object set contained in the document to be processed;
and screening the text object set of the document to be processed according to the document boundary marks so as to determine a form coverage area in the document to be processed.
In an alternative embodiment, the document boundary markers include: any one group of document boundary keywords, document boundary regular expressions, document boundary keywords and relative offsets, and document boundary regular expressions and relative offsets;
the screening is performed on the text object set of the document to be processed according to the document boundary mark to determine a table coverage area in the document to be processed, and the screening includes:
screening out the document boundary keywords from the text object set of the document to be processed, determining the document boundary of the document to be processed according to the positions of the document boundary keywords, and taking the text objects between the document boundaries as a form coverage area of the document to be processed; alternatively, the first and second liquid crystal display panels may be,
screening a first position of the document boundary keyword in the text object set of the document to be processed, determining a second position of the relative offset of the document boundary keyword in the text object set by taking the document boundary keyword as a reference so as to determine the document boundary of the document to be processed, and taking the text object between the document boundaries as a table coverage area of the document to be processed; alternatively, the first and second electrodes may be,
screening out the document boundary regular expression from the text object set of the document to be processed, determining the document boundary of the document to be processed according to the position of the document boundary regular expression, and taking the text objects between the document boundaries as a table coverage area of the document to be processed; alternatively, the first and second liquid crystal display panels may be,
and screening a third position of the document boundary regular expression in the text object set of the document to be processed, determining a fourth position of the relative offset of the document boundary regular expression in the text object set by taking the document boundary regular expression as a reference so as to determine the document boundary of the document to be processed, and taking the text object between the document boundaries as a table coverage area of the document to be processed.
In an optional embodiment, the table coverage area is located in M pages of contents of the text to be processed in a covering manner, and the table coverage area includes M subsets of text objects;
the in-page boundary markers include: the method comprises the steps of representing key words of boundary in a page or representing regular expressions of the boundary in the page;
the extracting a table area in each page in the table coverage area according to the intra-page boundary mark comprises:
aiming at the mth text object subset, comparing the keywords of the page inner boundaries with the text character strings in the mth text object subset to screen out the page inner boundaries in the mth text object subset, wherein the text objects between the page inner boundaries are used as the table area of the mth page; alternatively, the first and second electrodes may be,
and aiming at the mth text object subset, matching the regular expression of the inner page boundary with the text character strings in the mth text object subset to screen out the inner page boundary in the mth text object subset, wherein the text objects between the inner page boundaries are used as a table area of the mth page, and M is a positive integer which is greater than or equal to 1 and less than or equal to M.
In an alternative embodiment, the text object includes a text string and bounding box coordinates of the text string;
the determining the target table content according to the table area of each page comprises:
determining the bounding box coordinates of the table area of each page according to the bounding box coordinates of the text objects contained in the table area of each page;
and sequentially splicing the target table content by the bounding box coordinates of the table areas of each page.
In an optional embodiment, the bounding box coordinates of the table regions of the pages sequentially splice target table contents, which includes:
for an ith page table area and an (i + 1) th page table area, carrying out scaling processing on bounding box coordinates of the (i + 1) th page table area so as to enable the width of the scaled (i + 1) th page table area to be the same as the width of the ith page table area;
splicing the (i + 1) th page table area with the (i + 1) th page table area after the scaling treatment;
and the value of i is sequentially from 1 to M-1, so that M page table areas are sequentially spliced into the target table content.
In an alternative embodiment, the in-page boundary markers include: an in-page upper boundary marker and an in-page lower boundary marker;
the extracting a table area in each page in the table coverage area according to the intra-page boundary mark comprises: for each page in the table coverage area, determining an upper table boundary in the page according to the upper boundary mark in the page; and determining a lower boundary of an inner page table according to the lower edge mark in the page, and forming a table area between the upper boundary of the inner page table and the lower boundary of the inner page table.
In the above embodiment, the non-table area can be eliminated by the upper boundary mark in the page and the lower boundary mark in the page, so as to better locate the effective table area in the page.
In an alternative embodiment, the boundary marker comprises: column offset and row reference columns;
the determining the target table content according to the table area of each page comprises the following steps: splicing the table area of each page to obtain initial table data; dividing the initial table data into a plurality of columns of table column data according to the column offset; determining a plurality of rows of table line data in the initial table data according to the row reference column, wherein the row reference column is one column in the table in the target format; and determining the content of the target table according to the multi-column table column data and the multi-row table row data.
In the above embodiment, the content in each hidden cell in the frameless table can be recognized even in a table without a line frame by the column offset and the row reference column, and the distribution of data in the table, the head of the table, and the like can be presented more accurately based on the target table content determined based on the multi-column table column data and the multi-row table row data.
In an optional embodiment, the determining, according to the row reference column, a plurality of rows of table row data in the initial table data includes:
and if the content storage mode of the row reference column is top alignment, determining the content between the line where the nth item of object of the row reference column in the initial table data and the previous line where the (N + 1) th item of object is located as the nth row of table data, wherein the value of N is greater than or equal to 1 and less than or equal to N-1, and N is the number of objects contained in the row reference column.
In an optional embodiment, the determining, according to the row reference column, a plurality of rows of table row data in the initial table data includes:
if the content storage mode of the row reference column is bottom alignment, determining the content between the upper boundary of the initial table data and the row where the first item object is located as the (n + 1) th row table data;
determining the content between the line next to the line where the nth object of the line reference column in the initial table data is located and the line where the (N + 1) th object is located as the (N + 1) th line table data, wherein the value of N is greater than or equal to 1 and less than or equal to N-1, and N is the number of objects contained in the line reference column.
In an optional embodiment, the determining, according to the row reference column, a plurality of rows of table row data in the initial table data includes:
if the content storage mode of the row reference column is in central alignment, determining a first offset between an upper boundary of the initial table data and a first object of the row reference column;
determining the lower boundary of the first row of table line data which is offset from the first item object by the first offset amount by taking the first item object as a reference;
determining the upper boundary of the initial table data and the lower boundary of the first row of table line data as first row of table line data;
determining the n +1 th offset between the lower boundary of the nth row table line data and the (n + 1) th item object of the row reference column;
determining the lower boundary of the (n + 1) th row table line data which is offset from the (n + 1) th object to be the (n + 1) th offset by taking the (n + 1) th object as a reference;
and determining the lower boundary of the table line data of the nth row and the lower boundary of the table line data of the (N + 1) th row as the table line data of the (N + 1) th row, wherein the value of N is more than or equal to 1 and less than or equal to N-1, and N is the number of objects contained in the row reference column.
In the embodiment, the invisible rows of the tables can be identified for the tables with different formats, so that accurate table contents are marked out.
In an optional embodiment, the determining the target table content according to the columns of the table columns and the rows of the table rows comprises: determining an output table according to the number of the columns of table row data and the number of the rows of table row data; and filling the output table according to the character contents in the multi-column table column data and the multi-row table row data to obtain the target table content.
In the above embodiment, the information in the extracted table can be output in a table format, so that the user can conveniently view the table content.
In an optional embodiment, the determining the target table content according to the multiple columns of table column data and the multiple rows of table row data includes:
determining whether a nested structure exists in the table line data of each row;
if the table line data has the nested structure, determining nested structure offset and nested keys according to the target table template;
if the table line data has the nested structure, determining a nested structure area in the jth row of table line data according to the nested structure offset aiming at the jth row of table line data;
in the nesting structure area, determining nesting keys contained in the jth row of table line data and key values corresponding to the nesting keys;
and determining the content of a target table according to the nested key and the key value corresponding to the nested key, the multi-column table column data and the multi-row table row data.
In the above embodiment, the nested content exists in the extracted table line data, and the nested content can also be extracted to update the layout of the table, so that the readability of the target table content is stronger.
In an optional embodiment, the determining, according to the nested key and the key value, the multiple columns of table column data, and the multiple rows of table row data corresponding to the nested key, the target table content includes: constructing a new header according to the nested key and the original header of the target format; and filling the new header with the key value corresponding to the nested key and the table column data to obtain the target table content.
In the embodiment, the extracted nested key is added to a part of the new header, so that the determined target table content is more visual, and a user can conveniently acquire information distribution in the table.
In a second aspect, the present invention provides a form template configuration method, including: obtaining a table sample document, wherein the table sample document comprises a table in a target format; performing text recognition on the table sample document to obtain a text object set; and receiving a selection operation of the boundary marker of the text object set to determine a target form template, wherein the target form template is used for form information extraction in the form information extraction method according to the foregoing embodiment.
In the above embodiment, the target form template is determined by the selection operation of the boundary marker to solve the extraction of information of different types of forms.
In an alternative embodiment, the target form template comprises: a document upper boundary marker, a document lower boundary marker, a page inner upper boundary marker, a page inner lower boundary marker, a column offset, and a row reference column;
the receiving a selection operation of a boundary marker of the text object set to determine a target form template includes:
displaying the text object set on a configuration operation interface, wherein the configuration operation interface comprises a document boundary definition area, an in-page boundary definition area, a column definition area and a row definition area;
generating the document upper boundary marker and the document lower boundary marker in the document boundary definition area based on the selection operation of the text object set;
generating an in-page upper boundary marker and an in-page lower boundary marker in an in-page boundary definition area based on a selection operation on the text object set;
determining the column offset based on a selected column dividing line in the text object set in the column definition area;
in the row definition area, the row reference column is determined based on a selection operation on the text object set.
In an alternative embodiment, the target form template further comprises: nested structural offset and nested key;
the configuration operation interface comprises: nesting the definition area;
the receiving a selection operation of a boundary marker of the text object set to determine a target form template further includes:
in the nesting definition area, determining a nesting reference position and the nesting structure offset based on the selection operation of the text object set;
and in the nesting definition area, determining the nesting key based on the selection operation in the nesting structure in the text object set.
In the above embodiment, a special operation area may be set for different boundary markers, so that the configuration of the form template may be more comprehensive, and more accurate extraction of form information based on the form template may be achieved.
In a third aspect, the present invention provides a form information extraction apparatus comprising: the device comprises a first acquisition module, a second acquisition module and a processing module, wherein the first acquisition module is used for acquiring a target form template corresponding to a document to be processed, and the document to be processed comprises a form in a target format; the mark determining module is used for determining the boundary mark of the table in the target format according to the target table template; and the table determining module is used for determining the target table content in the document to be processed according to the boundary marks.
In a fourth aspect, the present invention provides a form template configuration apparatus, including: the second acquisition module is used for acquiring a table sample document, wherein the table sample document comprises a table in a target format; the text recognition module is used for performing text recognition on the table sample document to obtain a text object set; an operation receiving module, configured to receive a selection operation on a boundary marker of the text object set to determine a target table template, where the target table template is used for table information extraction in the table information extraction method according to the foregoing embodiment.
In a fifth aspect, the present invention provides an electronic device, comprising: a processor, a memory storing machine-readable instructions executable by the processor, the machine-readable instructions being executable by the processor to perform the steps of the method according to any one of the preceding embodiments when the electronic device is running.
In a sixth aspect, the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method according to any of the preceding embodiments.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 is a block diagram of an electronic device according to an embodiment of the present disclosure;
fig. 2 is a flowchart of a table information extraction method provided in an embodiment of the present application;
fig. 3 is a schematic functional block diagram of a table information extraction apparatus according to an embodiment of the present application;
FIG. 4 is a flowchart of a form template configuration method provided in an embodiment of the present application;
fig. 5 is a functional block diagram of a form template configuration apparatus according to an embodiment of the present application.
Detailed Description
The technical solution in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
In addition to recognizing characters in a document, information of a table in the document is generally extracted in current processing of the document. The tables in various documents mainly include a frame table and a frame-free table. For the table with the frame (as shown in table 1 below), a table recognition algorithm can be used to recognize the table in the table with the frame first, and then recognize the characters in the table, so as to extract and obtain the complete table structure and content. As shown in Table 1 below, the table includes table metadata: header of first row, and table data: rows of actual values under the header. Each key value in the following table falls into a grid, so that the complete table structure and content can be extracted by identifying the table in the framed table and the characters in the table.
Figure BDA0003946607700000071
Figure BDA0003946607700000081
TABLE 1
However, for a frameless table, since there is no complete lattice, even if the table data is recognized, it is difficult to correspond to the header, and the division between the table data is difficult to be directly reflected. As shown in table 2, the table does not include a complete lattice, but only includes a header and a dividing line between the header and the table data.
Figure BDA0003946607700000082
TABLE 2
Since the table shown in table 2 does not fall into a complete wired grid for each key value, when the table is identified, the key values a1, b1, c1, and d1 may not be separated by entries, and the identified result may not be well associated with the head of the table.
As shown in table 3, the table does not contain a complete lattice, only contains the header and the dividing line between the header and the table data, and some other nested contents are also nested inside the table:
Figure BDA0003946607700000083
TABLE 3
Using existing table recognition algorithms, it may be difficult to compare KEY1: KEY11, KEY2: KEY21, KEY3: the nesting information such as key31 is identified, and the nesting content cannot be identified with the header information: gauge outfit A, gauge outfit B, gauge outfit C, gauge outfit D.
Based on the above current situation, the form information extraction method, the form template configuration method and the electronic device provided by the application can not only extract information of framed forms, but also extract information of some relatively complex frameless forms.
The form information extraction method and the form template configuration method provided by the application can be used in a machine Process Automation (RPA) technology. The RPA technology can simulate the operation of staff on a computer through a keyboard and a mouse in daily work, and can replace human beings to execute operations such as logging in a system, operating software, reading and writing data, downloading files, reading mails and the like. The automatic robot is used as the virtual labor force of an enterprise, so that the staff can be liberated from repeated and low-value work, and the energy is put into the work with high added value, so that the enterprise can realize the reduction of cost and the increase of benefit while realizing the digital intelligent transformation.
The RPA is a software robot which replaces manual tasks in the business process and interacts with a front-end system of a computer like a human, so the RPA can be regarded as a software program robot running in a personal PC or a server, and automatically repeats operations such as mail retrieval, attachment downloading, system login, data processing and analysis and other activities by imitating the operations performed by a user on the computer instead of the human, and is fast, accurate and reliable. Although the problems of speed and accuracy in human work are solved by specific rules set like the traditional physical robot, the traditional physical robot is a robot combining software and hardware, and can execute work only by matching with software under the support of specific hardware; the RPA robot is a pure software level, and can be deployed to any one PC and server to complete specified work as long as corresponding software is installed.
That is, RPA is a way to perform business operations using "digital staff" instead of people and its related technology. In essence, the RPA realizes unmanned operation of objects such as systems, software, webpages and documents on a computer by a human simulator through a software automation technology, acquires service information, executes service actions, and finally realizes flow automation processing, labor cost saving and processing efficiency improvement. It can be known from the description that, in order to implement RPA, the target contents to be operated need to be found from the document or the screen first, and then the operation can be automatically performed on the contents. Therefore, it is one of the techniques of interest to realize RPA by searching for keywords in a form document based on an input keyword.
In the implementation process of the table information extraction method provided by the application, an Optical Character Recognition (OCR) technology can be used. OCR technology refers to the process of translating character shapes on paper into computer text using character recognition methods. Namely, the process of scanning the text data, then analyzing and processing the image file and obtaining the character and layout information.
For facilitating understanding of the embodiments of the present application, first, an electronic device executing the form information extraction method and the form template configuration method disclosed in the embodiments of the present application will be described in detail.
As shown in fig. 1, is a block schematic diagram of an electronic device. The electronic device 100 may include a memory 111, a memory controller 112, a processor 113, a peripheral interface 114, an input-output unit 115, and a display unit 116. It will be understood by those of ordinary skill in the art that the structure shown in fig. 1 is merely exemplary and is not intended to limit the structure of the electronic device 100. For example, electronic device 100 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The above-mentioned elements of the memory 111, the memory controller 112, the processor 113, the peripheral interface 114, the input/output unit 115 and the display unit 116 are electrically connected to each other directly or indirectly, so as to implement data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The processor 113 is used to execute the executable modules stored in the memory.
The Memory 111 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 111 is used for storing a program, the processor 113 executes the program after receiving an execution instruction, and the method executed by the electronic device 100 defined by the process disclosed in any embodiment of the present application may be applied to the processor 113, or implemented by the processor 113.
The processor 113 may be an integrated circuit chip having signal processing capability. The Processor 113 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The peripheral interface 114 couples various input/output devices to the processor 113 and memory 111. In some embodiments, the peripheral interface 114, the processor 113, and the memory controller 112 may be implemented in a single chip. In other examples, they may be implemented separately from each other.
The input/output unit 115 is used for providing data input to the user. The input/output unit 115 may be, but is not limited to, a mouse, a keyboard, and the like.
The display unit 116 provides an interactive interface (e.g., a user operation interface) between the electronic device 100 and the user or is used for displaying image data to the user for reference. In this embodiment, the display unit may be a liquid crystal display or a touch display. In the case of a touch display, the display can be a capacitive touch screen or a resistive touch screen, which supports single-point and multi-point touch operations. The support of single-point and multi-point touch operations means that the touch display can sense touch operations simultaneously generated from one or more positions on the touch display, and the sensed touch operations are sent to the processor for calculation and processing.
In this embodiment, the memory 111 of the electronic device 100 stores a computer program of an OCR algorithm for recognizing a document, and the processor 113 may extract text information in the document when the OCR algorithm is called.
When the electronic device 100 in this embodiment is used to execute the steps in the form template configuration method, an operation interface for configuring the form template may be displayed in the display unit 116 of the electronic device 100.
The electronic device 100 in this embodiment may be configured to perform each step in each method provided in this embodiment. The following describes in detail the implementation process of the form information extraction method and the form template configuration method by using several embodiments.
Please refer to fig. 2, which is a flowchart of a table information extraction method according to an embodiment of the present disclosure. The specific flow shown in fig. 2 will be described in detail below.
Step 210, obtaining a target form template corresponding to the document to be processed.
Wherein, the document to be processed comprises a table in a target format. Illustratively, the table in the document to be processed may be a table across multiple pages.
The form templates of the forms in multiple formats can be stored in the database in advance, when the form information of the document needs to be extracted, the matched form templates can be screened from the database, and then the determined form templates are used for extracting the form information of the document.
Illustratively, each form template may be named by some key information, or each form template may be described by some key information. When the template expression templates need to be screened, the document key information can be extracted from the document to be processed, and the document key information is compared with each form template to screen out the target form template.
The key information used to describe the form template may be information specific to the file to which the form template belongs, such as the company to which the form template belongs, the document category, the file-specific format, and so forth. The key information may be represented by a keyword or a regular expression.
Step 220, according to the target form template, determining the boundary marks of the form in the target format.
The target form template may have the boundary marks of the forms in the target format recorded therein, and the boundary marks may be directly read from the target form template.
The boundary markers may be used to represent the boundaries of a table in a target format in a document.
The boundary marker may be represented by a keyword, a regular expression, a keyword and a relative offset, or the regular expression and the relative offset. Of course, the boundary marks may be different according to the actual table.
Step 230, determining the target table content in the document to be processed according to the boundary mark.
And comparing the boundary marks with the information in the document to be processed to determine the boundary of the table in the target format in the document to be processed so as to screen out the target table content.
Optionally, text information extraction may be performed on the file to be processed first to obtain a text object set of the file to be processed; then, the boundary mark is compared with each object in the text object set to determine the area of the table in the target format in the text object set, so as to obtain the target table content.
Optionally, the regions of the tables in the target format in the text object set may be spliced into a complete table as output.
The target table content includes the content in the table in the target format in the file to be processed, but the target table content may be the content in the table in the target format in the file to be processed in the form of a framed table.
By configuring the form template, the boundary marks of the forms in the target format can be extracted based on the form template, so that the forms in the target format contained in the documents to be processed can be determined in the documents to be processed based on the boundary marks, the identification and extraction of the forms are not limited to the framed forms, and the application scene of form information extraction is improved.
In some cases, the tables may not be presented continuously, e.g., the middle of a table may insert some content in the table that does not belong to. For example, some explanatory words are inserted at the beginning of each page of the file to be processed, and for example, the end of each page of the file to be processed may have some contents outside the table of page numbers, web addresses, and the like. To avoid identifying non-table information as the contents of a table, the boundary markers may include: document boundary markers and in-page boundary markers. For example, a table in the document to be processed may be a table across multiple pages, and the document boundary flag may be a page-crossing document boundary flag. The step 230 may include: step 231 and step 232.
Step 231, determining a table coverage area in the document to be processed according to the document boundary mark.
Optionally, a document to be processed may be determined for text recognition, so as to recognize a text object set contained in the document to be processed. The document to be processed may be identified, for example, by an OCR algorithm. The document to be processed can also be identified by the PDF interpretation module to determine the text object contained in the object to be processed. For example, a set of text objects contained in the object to be processed may be represented as a set of text objects S1.
Illustratively, each text object may include at least two portions of data: a text string, and bounding box coordinates of the text string, wherein the bounding box coordinates may include: the coordinates of the four orientations of the text string, for example, the coordinates of the bounding boxes of one of the text strings may be (x 0, y 0), (x 1, y 1), (x 2, y 2), (x 3, y 3), respectively, where (x 0, y 0) may represent the coordinates of the upper left corner of the text string, (x 1, y 1) may represent the coordinates of the lower left corner of the text string,
(x 2, y 2) may represent coordinates of an upper right corner of the text string, and (x 3, y 3) may represent coordinates of a lower right corner of the text string, and a rectangular box may be formed by the four coordinates of (x 0, y 0), (x 1, y 1), (x 2, y 2), and (x 3, y 3). In one example, the mean coordinate of the four coordinates of (x 0, y 0), (x 1, y 1), (x 2, y 2), (x 3, y 3) ((x 0+ x1+ x2+ x 3)/4, (y 0+ y1+ y2+ y 3)/4) is the center coordinate of the text string.
The text object set of the document to be processed may be screened according to the document boundary markers to determine a table coverage area, where the table coverage area may include part or all of the text objects of the text object set of the document to be processed. For example, the set of text objects corresponding to the screened table coverage area may be represented as text object set S2.
The document boundary marks are represented by document boundary keywords, document boundary regular expressions, document boundary keywords and relative offsets, and document boundary regular expressions and relative offsets.
Illustratively, if the document boundary marker is represented by a document boundary keyword, the document boundary may be determined in the set of text objects S1 by finding the document boundary keyword in the set of text objects S1. The text objects between the document boundaries can then be used as a set of text objects S2 corresponding to the coverage area of the table.
For example, if the document boundary marks are represented by keywords and relative offsets, the document boundary keywords of the document boundary marks may be located in the text object set S1, and then the document boundary of the document to be processed may be determined by determining the location of the relative offset of the keywords in the text object set S1 based on the document boundary keywords, and the text objects between the document boundaries may be used as the text object set S2 corresponding to the table coverage area.
For example, the keyword may be "keyword11" and the relative offset may be 2 lines of text offset down. The keyword "keyword11" may be first found in the text object set S1 and then the position of the keyword "keyword11" shifted downward by two lines is determined as the document boundary.
For example, if the document boundary flag is represented by a regular expression, the regular expression may be compared with the format of the text character strings in each text object in the text object set S1 to determine the document boundary in the text object set S1, and a set formed by text objects before the document boundary may be used as the text object set S2 corresponding to the table coverage area.
For example, if the document boundary mark is represented by a regular expression and a relative offset, the regular expression may be compared with the format of the text character string in each text object in the text object set S1, the position of the regular expression in the text object set S1 is determined, the position of the distance from the regular expression as the relative offset is determined in the text object set S1, the position is determined as a document boundary, and a set formed by text objects before the document boundary may be used as the text object set S2 corresponding to the table coverage area.
In step 232, the table area in each page is extracted from the table covering area according to the boundary mark in the page.
Taking the text object set S2 corresponding to the table coverage area as an example, if the table coverage area is located in M pages of contents in the document to be processed, the text object set S2 may include M text object subsets. Table extraction may be performed for each subset of textual objects with intra-page boundary markers to obtain a table area in each page. Wherein M is a positive integer greater than or equal to 1.
In this embodiment, the in-page boundary flag may be represented by an in-page boundary keyword, or may be represented by an in-page boundary regular expression.
For the mth text object subset, if the inner boundary mark of the page is represented by the inner boundary keyword of the page, the inner boundary keyword of the page can be compared with the text character string in the mth text object subset to screen out the inner boundary of the page in the mth text object subset, and the text object between the inner boundaries of the page can be used as the table area of the mth page. The table area of the mth page may be represented as a text object set S6_ m.
And if the intra-page boundary marks are represented by intra-page boundary regular expressions, matching the intra-page boundary regular expressions with the text character strings in the mth text object subset to screen out intra-page boundaries in the mth text object subset, wherein the text objects between the intra-page boundaries are used as table areas of the mth page.
Wherein M is a positive integer greater than or equal to 1 and less than or equal to M.
The text object sets S6_ x of each page may be merged, i.e. the text object set S6 of the table area of the document to be processed may be obtained.
Alternatively, the in-page boundary marks of the respective pages may be the same or different.
Illustratively, a corresponding intra-page boundary flag may be set for each page; for example, the in-page boundary marks provided for each page may be different. As another example, the in-page boundary marks provided for each page may not be identical. For another example, the first page may be set as the inner boundary flag of the first page, the mth page may be set as the inner boundary flag of the mth page, and the other pages may be set as the same inner boundary flags.
Step 233, determine the target table content according to the table area of each page.
Alternatively, the table areas of each page may be directly spliced, and the target table content may be obtained.
Alternatively, the position of the set of text objects S6_ x for each page may be adjusted based on the bounding box coordinates of the text strings of each object in the set of text objects S6_ x for the table region for each page to splice out the target table content.
For example, the bounding box coordinates of each page table region may be determined according to the bounding box coordinates of each text string of the text object set S6_ x of each page table region, and the bounding box coordinates of each page table region may be sequentially spliced to obtain the target table content.
The table in the target format in the document to be processed spans two pages below. The coordinates of the table bounding boxes on the first page and the second page are respectively as follows:
first page: [ lt (x 1, y 1), rt (x 2, y 1), ld (x 1, y 2), rd (x 2, y 2) ];
a second page: [ lt (x 3, y 3), rt (x 4, y 3), ld (x 3, y 4), rd (x 4, y 4) ];
where lt (x 1, y 1) represents the coordinates of the upper left corner of the table body of the first page, rt (x 2, y 1) represents the coordinates of the upper right corner of the table body of the first page, ld (x 1, y 2) represents the coordinates of the lower left corner of the table body of the first page, and rd (x 2, y 2) represents the coordinates of the lower right corner of the table body of the first page; lt (x 3, y 3) denotes coordinates of the upper left corner of the table body of the second page, rt (x 4, y 3) denotes coordinates of the upper right corner of the table body of the second page, ld (x 3, y 4) denotes coordinates of the lower left corner of the table body of the second page, and rd (x 4, y 4) denotes coordinates of the lower right corner of the table body of the second page.
Wherein the width of the first page table is w1= | x1-x2|, and the width of the second page table is w2= | x3-x4|. In order to enable the table body of the second page to be spliced with the table body of the first page, the second page may be scaled so that it has the same width as the first page.
The scaling factor by which the second page needs to be scaled relative to the first page can be expressed as: c = w1/w2. The bounding box after scaling the second page is: [ lt (x 3, y 3), rt (x 4', y 3), ld (x 3, y 4'), rd (x 4', y 4') ]. Where | x3-x4' | = w1. The bounding box coordinates of the text string in the second page table body may also be scaled as described above.
Secondly, in order to splice the table body of the first page and the table body of the second page, translation transformation is required, the table body of the second page and the table body of the first page can be aligned to the left, and the top of the table body of the second page and the bottom of the table body of the first page can be spliced. Taking the above example as an example, the coordinate of the table body of the second page may be translated by a distance: Δ x = x1-x3, Δ y = y1-y3. That is, the bounding box coordinates [ lt (x 3, y 3), rt (x 4', y 3), ld (x 3, y 4'), rd (x 4', y 4') ] of the table body of the second page and the bounding box coordinates (x, y) of the text string after the scaling processing are all added (Δ x, Δ y).
The above is only exemplary, and if the table of the target format of the document to be processed covers more pages, the subsequent third page and fourth page may also be processed in turn in the above manner, so as to splice the table areas of the pages.
In different documents, the form accounts for different contents in the documents, for example, some documents have some other contents before the form is presented, and some ending contents are included after the form; for another example, some documents end with a table, after which there is no other content; for another example, some documents have a table as the starting content, and some ending content is included after the table. Thus, document boundary labels may also differ for different documents. The determination of the table coverage in different cases is described in the following by several embodiments.
In one embodiment, the document boundary markers may include: upper document boundary markers and lower document boundary markers.
Step 231 may include: determining a table start bit in the document to be processed according to the upper boundary mark of the document; and determining a form termination bit in the document to be processed according to the document lower boundary mark.
Wherein, a table covering area is formed between the table start bit and the table end bit.
The document upper boundary mark can be represented by a keyword, a regular expression, a keyword and a relative offset, or a regular expression and a relative offset. The document lower boundary mark can be represented by a keyword, a regular expression, a keyword and a relative offset, or a regular expression and a relative offset.
Illustratively, if the boundary mark on the document is represented by a keyword, the boundary mark on the document is compared with each text character string in the text object set S1 of the document to be processed to determine the table start position.
If a plurality of text strings with the same boundary marks as those of the document exist in each text string in the text object set S1, the position of the text string ranked at the top of the document to be processed may be used as the table start bit.
If the boundary mark on the document is represented by keywords and relative offset, comparing the keywords in the boundary mark on the document with each text character string in the text object set S1 of the document to be processed to determine the position of the keywords, determining the position of the position which is away from the position of the keywords in the text object set S1 and is the position of the relative offset, and taking the position as the starting position of the table.
Illustratively, if the lower boundary mark of the document is represented by a keyword, the lower boundary mark of the document is compared with each text character string in the text object set S1 of the document to be processed to determine the table ending bit.
If a plurality of text strings with the same mark as the lower edge of the document exist in each text string in the text object set S1, the position of the text string sorted at the rearmost of the document to be processed may be used as the table ending position.
Illustratively, if the upper boundary mark of the document is represented by a regular expression, the upper boundary mark of the document is compared with the format of each text string in the text object set S1 of the document to be processed to determine the start position of the table.
If a plurality of text strings with the same format as the boundary marker on the document exist in each text string in the text object set S1, the position of the text string ranked at the top of the document to be processed may be used as the table start bit.
Illustratively, if the document lower boundary mark is represented by a regular expression, the document lower boundary mark is compared with the format of each text string in the text object set S1 of the document to be processed to determine the table termination bit.
If a plurality of text strings with the same format as the lower boundary marks of the document exist in each text string in the text object set S1, the position of the text string sorted at the rearmost of the document to be processed may be used as the table ending bit.
In one embodiment, the document boundary markers may include only upper document boundary markers. Step 231 may include: determining a form start bit in the document to be processed according to the upper boundary mark of the document; and determining the end position of the document to be processed as a form end position.
In one embodiment, the document boundary markers include: and marking the lower boundary of the document.
Step 231 may include: determining the starting position of the document to be processed as a table starting position; and determining a form termination bit in the document to be processed according to the document lower boundary mark.
The identification of the table coverage area of different documents is adapted through the document boundary marks.
In different documents, the presentation of the form may be different at each page, e.g., starting with the header for each page; as another example, the bottom of each page may include information such as a web address, associated code, etc. Thus, the intra-page boundary markers for different documents may also differ. The determination of the table coverage in different cases is described in the following by several embodiments.
In one embodiment, the in-page boundary markers include: an in-page upper boundary marker and an in-page lower boundary marker. Step 232 may include: for each page in the table coverage area, determining an upper table boundary in the page according to an upper boundary mark in the page; and determining the lower boundary of the table in the page according to the lower boundary mark in the page.
A table region is formed between an upper boundary of the inner table of the page and a lower boundary of the inner table of the page.
In this embodiment, the boundary mark in the page may be represented by a keyword, a regular expression, a keyword and a relative offset, or a regular expression and a relative offset.
In one embodiment, the intra-page boundary markers include: the upper boundary within the page. Step 232 may include: for each page in the table coverage area, determining an upper table boundary in the page according to an upper boundary mark in the page; the bottom of the page is taken as the page inner table lower boundary.
In one embodiment, the intra-page boundary markers include: and marking the lower boundary in the page. Step 232 may include: for each page in the table coverage area, taking the top of the page as an inner table page upper boundary; and determining the lower boundary of the inner table of the page according to the lower boundary mark in the page.
The identification of the table areas of different documents is adapted through the document boundary marks.
In order to make each key value in the table more intuitive, therefore, the boundary marker may further include: column offset and row reference columns. This step 233 may include steps 2331 through 2334.
At 2331, initial table data is obtained based on the table region of each page.
The manner of splicing the table areas of each page may be similar to the manner of splicing the table areas of two pages described in the above step 233, and reference may be specifically made to the description of the manner of splicing in the above step 233.
At 2332, the initial tabular data is divided into columns of tabular column data according to the column offset.
Wherein the number of column offsets in the boundary marker may be equal to the number of columns of the header in the table of the target format in the file to be processed. Taking the above table 2 as an example, the table includes: the header A, the header B, the header C and the header D are four columns, and four columns of table column data can be determined through three column offsets.
Illustratively, each column offset may be represented by an abscissa or by a distance from the abscissa of the upper left-hand corner of the bounding box of the header.
The area from the abscissa of the upper left corner of the bounding box of the header of the initial tabular data to the abscissa of the first column offset is the first column tabular column data; the area between the abscissa of the first column offset and the abscissa of the second column offset is the second tabular column data; the area between the abscissa of the ith column offset to the abscissa of the (i + 1) th column offset is the (i + 1) th tabular column data.
Taking table 2 as an example, the three column offsets may be c1, c2, and c3, respectively. The abscissa of the upper left corner of the bounding box of the header of the initial table data may be represented as x0.
Determining that the area with x0 to x0+ c1 abscissa in the initial table data is the first column of table column data; the area with the abscissa from x0+ c1 to x0+ c2 in the initial table data is the second table column data; the area between x0+ c2 and x0+ c3 of the abscissa in the initial table data is the third column of table column data; the region of the initial tabular data with the abscissa x0+ c3 to the rightmost side of the initial tabular data is the second tabular column data.
At 2333, a plurality of rows of tabular line data in the initial tabular data are determined according to the row reference column.
Wherein the row reference column is one of the columns in the table of the target format.
In one embodiment, if the content of the row reference column is stored in top alignment, the content between the line where the nth object of the row reference column in the initial table data is located and the previous line where the (N + 1) th object is located is determined as the nth row table data, where N is greater than or equal to 1 and less than or equal to N-1, and N is the number of objects included in the row reference column.
In one embodiment, if the content storage mode of the row reference column is bottom alignment, determining the content between the upper boundary of the initial table data and the row where the first object is located as the (n + 1) th row table data; determining the content between the line next to the line where the nth object of the line reference column in the initial table data is located and the line where the (N + 1) th object is located as the (N + 1) th line table data, wherein the value of N is greater than or equal to 1 and less than or equal to N-1, and N is the number of objects contained in the line reference column.
In one embodiment, if the content storage manner of the row reference column is center alignment, determining a first offset between an upper boundary of the initial table data and a first object of the row reference column; determining the lower boundary of the first row of table line data which is offset from the first item object by the first offset amount by taking the first item object as a reference; determining the upper boundary of the initial table data and the lower boundary of the first row of table row data as first row of table row data; determining the n +1 th offset between the lower boundary of the table line data of the nth row and the (n + 1) th item object of the row reference column; determining the lower boundary of the (n + 1) th row table line data which is offset from the (n + 1) th object to be the (n + 1) th offset by taking the (n + 1) th object as a reference; and determining the lower boundary of the table line data of the nth row and the lower boundary of the table line data of the (N + 1) th row as the table line data of the (N + 1) th row, wherein the value of N is more than or equal to 1, and is less than or equal to N-1, and N is the number of objects contained in the row reference column.
Illustratively, the midpoint position of the first object may be referenced.
Step 2334, determining the target table content according to the multiple columns of table column data and the multiple rows of table row data.
For example, the target table content may be obtained by directly assembling the multiple columns of table column data and the multiple rows of table row data into a table.
Optionally, an output table may be determined according to the number of the columns of the table and the rows of the table; and filling the output table according to the character contents in the multi-column table column data and the multi-row table row data to obtain the target table content.
For example, if the I-list grid data and the J-row grid data are determined, a table with J +1 rows and I columns can be constructed.
By the mode, the initial table data can be divided into rows and columns, so that the determined target table content is more visual and has stronger readability.
For some tables with nested structures as shown in table 3, the nested structures can also be extracted to construct new tables. Step 2234 may include: determining whether a nested structure exists in the data of each row of the table; if the nested structure exists in the table line data, determining the nested structure offset and nested keys according to the target table template; if the table line data has a nested structure, aiming at the jth line of table line data, determining a nested structure area in the jth line of table line data according to the nested structure offset; in the nesting structure area, determining nesting keys contained in the jth row of table line data and key values corresponding to the nesting keys; and determining the target table content according to the nested key and the key value corresponding to the nested key, the multiple columns of table column data and the multiple rows of table row data.
Wherein J is greater than or equal to 1 and less than or equal to J. J denotes the number of rows of the determined table line data.
For example, it may be determined whether there are exaggerated columns of data based on the table row data, and it may be determined that a nested structure of the table row data exists.
Illustratively, if there are multiple rows of data in a portion of data in any row of table row data. For example, as shown in table 3, the first row of table row data includes contents not belonging to header a, header B, header C, and header D: KEY1: KEY11, KEY2: KEY21, KEY3: key31. It may be determined that a nested structure exists for the table row data.
The nested structure offset may be expressed as a distance of an upper boundary in the row of table lines from the nested structure region. Taking the jth row of table row data as an example, if the ordinate of the upper boundary of the table body of the jth row of table row data can be represented as yj, the nested structure offset can be yd, and a region between the ordinate yj + yd that can be determined and the upper boundary yj +1 of the next row of table row data can be determined as a nested structure region.
Optionally, the designated symbol is located between the nested key and the key value corresponding to the nested key. The key value corresponding to the nested key and the nested key can be determined according to the identification of the designated symbol.
Taking the example shown in table 3 as an example, the designated symbol may be ": ", one of which": if the KEY1 and the KEY11 are respectively located on two sides of the KEY, it can be determined that the KEY11 is the KEY value corresponding to the nested KEY.
Optionally, a key value corresponding to each nested key may be determined according to the key offset. Illustratively, each nested key corresponds to a key offset, and after the nested key is determined, the position where the distance from the nested key is the key offset is the key value corresponding to the nested key.
Determining the target table content according to the nested key and the key value corresponding to the nested key, the multiple columns of table column data and the multiple rows of table row data may include: constructing a new header according to the nested key and the original header of the target format; and filling the new header with the key value corresponding to the nesting key and the table column data to obtain the target table content.
Taking the example shown in table 3 as an example, the nested KEYs may include KEY1, KEY2, and KEY3, and then constructing a new header may include: gauge head A, gauge head B, gauge head C, gauge head D, KEY1, KEY2 and KEY3. Then, the columns of KEY1, KEY2, and KEY3 are filled with corresponding KEY values, so that the following table 4 can be obtained after table 3 is processed.
Watch head A Watch head B Watch head C Watch head D KEY1 KEY2 KEY3
a1 b1 c1 d1 key11 key21 key31
a2 b2 c2 d2 key12 key22 key32
TABLE 4
By the method, information of different frameless tables can be extracted, and framed tables which can be read visually are output.
Based on the same application concept, a table information extraction device corresponding to the table information extraction method is also provided in the embodiments of the present application, and because the principle of solving the problem of the device in the embodiments of the present application is similar to that in the embodiments of the table information extraction method, the implementation of the device in the embodiments of the present application can refer to the description in the embodiments of the method, and the repeated parts are not described again.
Please refer to fig. 3, which is a schematic diagram of a functional module of a table information extraction apparatus according to an embodiment of the present application. Each module in the table information extraction device in this embodiment is configured to execute each step in the above method embodiments. The table information extraction device includes: a first obtaining module 310, a mark determining module 320, and a table determining module 330; the contents of each module are as follows:
a first obtaining module 310, configured to obtain a target form template corresponding to a document to be processed, where the document to be processed includes a form in a target format;
a mark determining module 320, configured to determine a boundary mark of the table in the target format according to the target table template;
the table determining module 330 is configured to determine a target table content in the document to be processed according to the boundary marker.
Please refer to fig. 4, which is a flowchart of a form template configuration method according to an embodiment of the present disclosure. The specific flow shown in fig. 4 will be described in detail below.
At step 410, a form sample document is obtained.
Wherein the table sample document includes a table in a target format.
Illustratively, the tabular sample document may be a representative file.
In step 420, text recognition is performed on the form sample document to obtain a text object set.
Optionally, the table sample document may be subjected to text recognition by an OCR algorithm to extract the character strings and the coordinates of the respective characters in the table sample document.
Illustratively, each text object in the set of text objects may include: a character string and coordinates of the character string. For example, the coordinates of the character string may be represented by bounding box coordinates or may be represented by single-point coordinates.
Step 430, receiving a selection operation of the boundary marker of the text object set to determine the target form template.
The target form template is used for form template configuration in the form template configuration method.
Illustratively, the target form template includes: document upper boundary marker, document lower boundary marker, page inner upper boundary marker, page inner lower boundary marker, column offset, and row reference column.
The above step 430 may include: displaying the text object set on a configuration operation interface; in a document boundary definition area of the configuration operation interface, generating an upper document boundary mark and a lower document boundary mark based on the selection operation of the text object set; generating an upper boundary mark in the page and a lower boundary mark in the page based on the selection operation of the text object set in an inner page boundary definition area of a configuration operation interface; in a column definition area of the configuration operation interface, determining the column offset based on a column segmentation line selected in the text object set; and in a row definition area of the configuration operation interface, determining the row reference column based on the selection operation of the text object set.
The configuration operation interface comprises a document boundary definition area, an in-page boundary definition area, a column definition area and a row definition area.
Illustratively, the document boundary defining zone, the in-page boundary defining zone, the column defining zone, and the row defining zone may be presented in the form of separate windows. After the designated key is triggered in the configuration operation interface, independent windows of a document boundary definition area, an in-page boundary definition area, a column definition area and a row definition area can be started.
For example, the document boundary definition region defines a document boundary definition window, which may include an input box for receiving user input of information related to the document boundary. For another example, the input box may be used to position the position of the mouse cursor and the selected content in real time, and after receiving the enter key, the content selected by the mouse cursor is used as the document boundary.
For example, the in-page boundary defining area defines a window for an in-page boundary, and the window may include an input box therein, the input box being used for receiving information about the in-page boundary input by a user. For another example, the input box may be used to position the position of the mouse cursor and the selected content in real time, and after receiving the enter key, the content selected by the mouse cursor is used as the in-page boundary.
For example, the column definition area defines a window for a column, which may include an input box for receiving column related information input by a user. For another example, the input box may be used to position the position of the mouse cursor and the selected content in real time, and after receiving the enter key, the content selected by the mouse cursor is used as a column.
For example, the row definition area defines a window for a row, which may include an input box for receiving information about the row entered by the user. For another example, the input box may be used to position the position of the mouse cursor and the selected content in real time, and after receiving the enter key, the content selected by the mouse cursor is used as a line.
Illustratively, the document boundary definition area, the in-page boundary definition area, the column definition area, and the row definition area may also be displayed in different areas of the configuration operation interface. The document boundary defining area can be used for configuring document boundaries, the page inner boundary defining area can be used for configuring page inner boundaries, the column defining area can be used for configuring column offsets, and the row defining area can be used for configuring row reference columns.
Illustratively, each of the document boundary defining area, the in-page boundary defining area, the column defining area, and the row defining area may include a receiving box therein, which may be used to receive user input of a document upper boundary flag, a document lower boundary flag, an in-page upper boundary flag, an in-page lower boundary flag, a column offset, and a row reference column.
Alternatively, the receiving boxes of the respective document boundary defining area, in-page boundary defining area, column defining area, and row defining area may also be used to fill the document upper boundary flag, document lower boundary flag, in-page upper boundary flag, in-page lower boundary flag, column offset, and row reference column. When a user selects the text object set, the generated document upper boundary marker, document lower boundary marker, page inner upper boundary marker, page inner lower boundary marker, column offset and row reference column are automatically generated and filled into the receiving box.
Optionally, the target form template further comprises: nested configuration offset and nested key. Configuring the operation interface may further include: the nested definition region.
The step 430 may further include: in the nesting definition area, determining a nesting reference position and the nesting structure offset based on the selection operation of the text object set; in the nesting definition area, the nesting key is determined based on the selection operation of the nesting structure in the text object set.
Optionally, the nesting definition area may also include a receiving box that may receive a nesting structure offset and a nesting key input by a user.
Optionally, the receiving box of the nesting definition area can also be used to fill in nesting structure offsets and nesting keys. When a user selects the text object set, the nested structure offset and the nested key can be automatically generated, and the generated nested structure offset and the nested key are filled into the receiving box.
And determining a target form template through the selection operation of the boundary marks so as to solve the extraction of the information of the forms of different types. Furthermore, a special operation area can be set for different boundary marks, so that the configuration of the form template can be more comprehensive, and more accurate form information extraction based on the form template can be realized.
Based on the same application concept, a form template configuration apparatus corresponding to the form template configuration method is further provided in the embodiments of the present application, and since the principle of the apparatus in the embodiments of the present application for solving the problem is similar to that in the embodiments of the form template configuration method, the apparatus in the embodiments of the present application may be implemented as described in the embodiments of the method, and repeated details are not described again.
Please refer to fig. 5, which is a functional module diagram of a form template configuration apparatus according to an embodiment of the present application. Each module in the form template configuration apparatus in this embodiment is configured to perform each step in the above method embodiments. The form template configuration device includes: a second obtaining module 510, a text recognition module 520, and an operation receiving module 530; the contents of each module are as follows:
a second obtaining module 510, configured to obtain a table sample document, where the table sample document includes a table in a target format;
a text recognition module 520, configured to perform text recognition on the table sample document to obtain a text object set;
an operation receiving module 530, configured to receive a selection operation on a boundary marker of the text object set to determine a target form template, where the target form template is used for form information extraction in the form information extraction method according to the foregoing embodiment.
In addition, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the steps of the form information extraction method or the form template configuration method described in the above method embodiment.
The computer program product of the table information extraction method and the table template configuration method provided in the embodiments of the present application includes a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the steps of the table information extraction method or the table template configuration method described in the above method embodiments, which may be specifically referred to in the above method embodiments and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising 8230; \8230;" comprises 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (18)

1. A form information extraction method is characterized by comprising the following steps:
acquiring a target form template corresponding to a document to be processed, wherein the document to be processed comprises a form in a target format;
determining the boundary marks of the forms in the target format according to the target form template;
and determining the target table content in the document to be processed according to the boundary mark.
2. The method of claim 1, wherein the boundary marker comprises: document boundary markers and in-page boundary markers;
determining the target table content in the document to be processed according to the boundary mark, wherein the determining comprises the following steps:
determining a form coverage area in the document to be processed according to the document boundary mark;
extracting a table area in each page in the table coverage area according to the intra-page boundary mark;
and determining the content of the target table according to the table area of each page.
3. The method of claim 2, wherein the document boundary marking comprises: the document upper boundary mark and the document lower boundary mark, or the document boundary mark comprises: marking a boundary on the document;
determining a form coverage area in the document to be processed according to the document boundary mark, wherein the determining comprises the following steps: determining a form start bit in the document to be processed according to the document upper boundary mark; determining a form termination position in the document to be processed according to the document lower boundary mark, wherein a form coverage area is formed between the form start position and the form termination position; alternatively, the first and second electrodes may be,
determining a form coverage area in the document to be processed according to the document boundary mark, wherein the determining comprises the following steps:
determining a form start bit in the document to be processed according to the document upper boundary mark;
and determining the end position of the document to be processed as a form end position, wherein a form coverage area is formed between the form start position and the form end position.
4. The method according to claim 2, wherein the determining a table coverage area in the document to be processed according to the document boundary markers comprises:
performing text recognition on the document to be processed to determine a text object set contained in the document to be processed;
and screening the text object set of the document to be processed according to the document boundary marks so as to determine a table coverage area in the document to be processed.
5. The method of claim 4, wherein the document boundary marker comprises: any one group of document boundary keywords, document boundary regular expressions, document boundary keywords and relative offsets, and document boundary regular expressions and relative offsets;
the screening is performed on the text object set of the document to be processed according to the document boundary mark to determine a form coverage area in the document to be processed, and the screening includes:
screening out the document boundary keywords in the text object set of the document to be processed, determining the positions of the document boundary keywords as the document boundary of the document to be processed, and taking the text objects between the document boundaries as the form coverage area of the document to be processed; alternatively, the first and second electrodes may be,
screening a first position of the document boundary keyword in the text object set of the document to be processed, determining a second position of the relative offset of the document boundary keyword in the text object set by taking the document boundary keyword as a reference so as to determine the document boundary of the document to be processed, and taking the text objects between the document boundaries as a table coverage area of the document to be processed; alternatively, the first and second liquid crystal display panels may be,
screening out the document boundary regular expression from the text object set of the document to be processed, determining the document boundary of the document to be processed according to the position of the document boundary regular expression, and taking the text objects between the document boundaries as a table coverage area of the document to be processed; alternatively, the first and second liquid crystal display panels may be,
and screening a third position of the document boundary regular expression in the text object set of the document to be processed, determining a fourth position of the relative offset of the document boundary regular expression in the text object set by taking the document boundary regular expression as a reference so as to determine the document boundary of the document to be processed, and taking the text object between the document boundaries as a table coverage area of the document to be processed.
6. The method of claim 4, wherein the form overlay is located in M pages of content of the text to be processed, and wherein the form overlay comprises M subsets of text objects;
the in-page boundary markers include: the method comprises the steps of representing key words of boundary in a page or representing regular expressions of the boundary in the page;
the extracting a table area in each page in the table coverage area according to the intra-page boundary mark comprises:
aiming at the mth text object subset, comparing the keywords of the inner page boundaries with the text character strings in the mth text object subset to screen out the inner page boundaries in the mth text object subset, wherein the text objects between the inner page boundaries are used as the table area of the mth page; alternatively, the first and second liquid crystal display panels may be,
and aiming at the mth text object subset, matching the regular expression of the intra-page boundaries with the text character strings in the mth text object subset so as to screen out the intra-page boundaries in the mth text object subset, wherein the text objects between the intra-page boundaries are used as a table area of the mth page, and M is a positive integer which is greater than or equal to 1 and less than or equal to M.
7. The method of claim 2, wherein the text object comprises a text string and bounding box coordinates of the text string;
the determining the target table content according to the table area of each page comprises:
determining bounding box coordinates of the table areas of each page according to the bounding box coordinates of the text objects contained in the table areas of each page;
and sequentially splicing the target table content by the bounding box coordinates of the table areas of each page.
8. The method according to claim 7, wherein the bounding box coordinates of the table regions of the pages are spliced into target table contents in sequence, and the method comprises:
for the ith page table area and the (i + 1) th page table area, carrying out scaling processing on bounding box coordinates of the (i + 1) th page table area so as to use the width of the (i + 1) th page table area after scaling to be the same as the width of the ith page table area;
splicing the (i + 1) th page table area with the (i + 1) th page table area after the scaling treatment;
and the value of i is sequentially from 1 to M-1, so that M page table regions are sequentially spliced into the target table content.
9. The method of claim 2, wherein said in-page boundary marking comprises: an in-page upper boundary marker and an in-page lower boundary marker;
the extracting a table area in each page in the table coverage area according to the intra-page boundary mark comprises:
for each page in the table coverage area, determining an upper table boundary in the page according to the upper boundary mark in the page;
and determining a lower boundary of an inner page table according to the lower edge mark in the page, and forming a table area between the upper boundary of the inner page table and the lower boundary of the inner page table.
10. The method of claim 2, wherein the boundary marker comprises: column offset and row reference columns;
the determining the target table content according to the table area of each page comprises:
obtaining initial table data according to the table area of each page;
dividing the initial table data into a plurality of columns of table column data according to the column offsets;
determining a plurality of rows of table line data in the initial table data according to the row reference column, wherein the row reference column is one column in the table in the target format;
and determining the content of the target table according to the multi-column table column data and the multi-row table row data.
11. The method of claim 10, wherein determining a plurality of rows of tabular line data in the initial tabular data from the row reference column comprises:
if the content storage mode of the row reference column is top alignment, determining the content between the line where the nth item of object of the row reference column in the initial table data is located and the previous line where the (N + 1) th item of object is located as the nth row of table data, wherein the value of N is greater than or equal to 1, less than or equal to N-1, and N is the number of objects contained in the row reference column; alternatively, the first and second liquid crystal display panels may be,
determining a plurality of rows of table row data in the initial table data according to the row reference column, including:
if the content storage mode of the row reference column is bottom alignment, determining the content between the upper boundary of the initial table data and the row where the first item object is located as the (n + 1) th row table data; determining the content between the line next to the line where the nth object of the line reference column in the initial table data is located and the line where the (N + 1) th object is located as the (N + 1) th line table data, wherein the value of N is greater than or equal to 1 and less than or equal to N-1, and N is the number of objects contained in the line reference column; alternatively, the first and second electrodes may be,
determining a plurality of rows of table row data in the initial table data according to the row reference column, including:
if the content storage mode of the row reference column is centered alignment, determining a first offset between an upper boundary of the initial table data and a first object of the row reference column;
determining the lower boundary of the first row of table line data which is offset from the first item object by the first offset amount by taking the first item object as a reference;
determining an upper boundary of the initial table data and a lower boundary of the first row of table row data as first row of table row data;
determining the n +1 th offset between the lower boundary of the nth row table line data and the (n + 1) th item object of the row reference column;
determining the lower boundary of the (n + 1) th row table line data which is offset from the (n + 1) th item object to be the (n + 1) th offset by taking the (n + 1) th item object as a reference;
and determining the lower boundary of the table line data of the nth row and the lower boundary of the table line data of the (N + 1) th row as the table line data of the (N + 1) th row, wherein the value of N is more than or equal to 1, less than or equal to N-1, and N is the number of objects contained in the row reference column.
12. The method of claim 10, wherein determining the target table content from the plurality of columns of table column data and the plurality of rows of table row data comprises:
determining an output table according to the number of the columns of table row data and the number of the rows of table row data;
and filling the output table according to the character contents in the multi-column table column data and the multi-row table row data to obtain the target table content.
13. The method of claim 10, wherein determining the target table content from the plurality of columns of table column data and the plurality of rows of table row data comprises:
determining whether a nested structure exists in the table line data of each row;
if the table row data have the nested structure, determining nested structure deviation and nested keys according to the target table template;
if the table line data have the nested structure, aiming at the jth line of table line data, determining a nested structure area in the jth line of table line data according to the nested structure deviation;
in the nesting structure area, determining nesting keys contained in the jth row of table line data and key values corresponding to the nesting keys;
and determining the target table content according to the nested key and the key value corresponding to the nested key, the multiple columns of table column data and the multiple rows of table row data.
14. The method of claim 10, wherein determining the target table content according to the nested key and the key value corresponding to the nested key, the columns of table columns and the rows of table rows comprises:
constructing a new header according to the nested key and the original header of the target format;
and filling the new header with the key values corresponding to the nested keys and the table column data to obtain the target table content.
15. A form template configuration method, comprising:
obtaining a table sample document, wherein the table sample document comprises a table in a target format;
performing text recognition on the table sample document to obtain a text object set;
and receiving a selection operation of the boundary marks of the text object set to determine a target form template.
16. The method of claim 15, wherein the target form template comprises: a document upper boundary marker, a document lower boundary marker, a page inner upper boundary marker, a page inner lower boundary marker, a column offset, and a row reference column;
the receiving a selection operation of a boundary marker of the text object set to determine a target table template includes:
displaying the text object set on a configuration operation interface, wherein the configuration operation interface comprises a document boundary definition area, an in-page boundary definition area, a column definition area and a row definition area;
generating the document upper boundary marker and the document lower boundary marker in the document boundary definition area based on the selection operation of the text object set;
generating an in-page upper boundary marker and an in-page lower boundary marker in an in-page boundary definition area based on a selection operation on the text object set;
determining the column offset based on a selected column dividing line in the text object set in the column definition area;
in the row definition area, the row reference column is determined based on a selection operation on the text object set.
17. The method of claim 15, wherein the target form template further comprises: nested structure offset and nested keys;
the configuration operation interface comprises: nesting the definition area;
the receiving a selection operation of the boundary marker of the text object set to determine a target table template further includes:
in the nesting definition area, determining a nesting reference position and the nesting structure offset based on the selection operation of the text object set;
and in the nesting definition area, determining the nesting key based on the selection operation of the nesting structure in the text object set.
18. An electronic device, comprising: a processor, a memory storing machine-readable instructions executable by the processor, the machine-readable instructions when executed by the processor performing the steps of the method of any one of claims 1 to 17 when the electronic device is operated.
CN202211435725.4A 2022-11-16 2022-11-16 Form information extraction method, form template configuration method and electronic equipment Pending CN115759020A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211435725.4A CN115759020A (en) 2022-11-16 2022-11-16 Form information extraction method, form template configuration method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211435725.4A CN115759020A (en) 2022-11-16 2022-11-16 Form information extraction method, form template configuration method and electronic equipment

Publications (1)

Publication Number Publication Date
CN115759020A true CN115759020A (en) 2023-03-07

Family

ID=85372064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211435725.4A Pending CN115759020A (en) 2022-11-16 2022-11-16 Form information extraction method, form template configuration method and electronic equipment

Country Status (1)

Country Link
CN (1) CN115759020A (en)

Similar Documents

Publication Publication Date Title
US7801358B2 (en) Methods and systems for analyzing data in media material having layout
JP3425408B2 (en) Document reading device
CN101297319B (en) Embedding hot spots in electronic documents
Dong et al. Tablesense: Spreadsheet table detection with convolutional neural networks
CN103838566A (en) Information processing device, and information processing method
CN112434691A (en) HS code matching and displaying method and system based on intelligent analysis and identification and storage medium
US11341319B2 (en) Visual data mapping
JPH11282955A (en) Character recognition device, its method and computer readable storage medium recording program for computer to execute the method
Kasar et al. Table information extraction and structure recognition using query patterns
Zanibbi et al. Math search for the masses: Multimodal search interfaces and appearance-based retrieval
Lehenmeier et al. Layout detection and table recognition–recent challenges in digitizing historical documents and handwritten tabular data
US11386263B2 (en) Automatic generation of form application
Tomovic et al. Aligning document layouts extracted with different OCR engines with clustering approach
Ishihara et al. Analyzing visual layout for a non-visual presentation-document interface
EP3470993A1 (en) A method and system for click thru capability of electronic media
CN115759020A (en) Form information extraction method, form template configuration method and electronic equipment
Chen et al. Genre identification for office document search and browsing
AU2018100324A4 (en) Image Analysis
Alzuru et al. Cooperative human-machine data extraction from biological collections
Blomqvist et al. Reading the ransom: Methodological advancements in extracting the swedish wealth tax of 1571
Kashevnik et al. An Approach to Engineering Drawing Organization: Title Block Detection and Processing
KR102635715B1 (en) A word cloud display system for patent documents
US20230351791A1 (en) Method, device, and system for outputting description of patent reference sign
Ubewikkrama Automatic invoice Data identification with relations
Lee et al. Automatic generation of structured hyperdocuments from document images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination