USRE50675E1 - Extracting information from tables embedded within documents - Google Patents
Extracting information from tables embedded within documentsInfo
- Publication number
- USRE50675E1 USRE50675E1 US17/859,132 US202217859132A USRE50675E US RE50675 E1 USRE50675 E1 US RE50675E1 US 202217859132 A US202217859132 A US 202217859132A US RE50675 E USRE50675 E US RE50675E
- Authority
- US
- United States
- Prior art keywords
- cells
- cell
- header
- row
- column
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
- G06F16/84—Mapping; Conversion
- G06F16/86—Mapping to a database
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
- G06F40/143—Markup, e.g. Standard Generalized Markup Language [SGML] or Document Type Definition [DTD]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
- G06F40/154—Tree transformation for tree-structured or markup documents, e.g. XSLT, XSL-FO or stylesheets
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/177—Editing, e.g. inserting or deleting of tables; using ruled lines
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/177—Editing, e.g. inserting or deleting of tables; using ruled lines
- G06F40/18—Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets
Definitions
- Key information can be contained within tables that are themselves embedded in documents, whether full-text journal articles, patents, slides or health records. For example, important experimental results may be contained within a table in a PowerPoint presentation, or key lab values relevant to a patient may be contained within a table in an electronic health record. Information contained within tables is hard to extract automatically with high accuracy due to the wide variety and low quality of typical tables found in electronic documents.
- table structures are typically represented in semi-structured formats like SGML, HTML, document or presentation formats such as Word or PowerPoint or various XML formats (e.g., XHTML, XML OASIS or CALS table models).
- XML formats e.g., XHTML, XML OASIS or CALS table models.
- OCR optical character recognition
- FIG. 1 A is a flow diagram illustrating a process used in some implementations for extracting table information from semi-structured text.
- FIG. 1 B is a flow diagram illustrating a process used in some implementations for extracting table information from unstructured text.
- FIG. 1 C is a flow diagram illustrating a process used in some implementations for extracting table information from unstructured text.
- FIG. 1 D is a flow diagram illustrating a process used in some implementations for extracting table information using OCR to create semi-structured text.
- FIGS. 2 A and 2 B show an example illustrating merging the cells of a table.
- FIG. 3 shows an example illustrating an annotated table in which a cell spans multiple rows.
- FIG. 4 is a flow diagram illustrating a process used in some implementations for associating cells in the same column via a shared index term column identifier.
- FIG. 5 A- 5 D show an example of processing a table within a patent document.
- FIG. 6 shows an example of information extracted from a processed table in FIG. 5 C .
- FIG. 7 shows highlighting within an example table rendered in HTML showing the evidence for the extraction provided in FIG. 6 .
- FIG. 8 A presents an example of table represented in plain text.
- FIG. 8 B shows an example of the table of FIG. 8 A converted to HTML format.
- references in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosure.
- the various appearances of the phrase “in one embodiment” in the specification do not necessarily refer to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
- various features are described that may be exhibited by some embodiments and not by others.
- various requirements are described that may be requirements for some embodiments but not others.
- the information extraction technology disclosed herein can provide a method of extracting information from heterogeneous tables in either semi-structured or unstructured text by recognizing headers and merged cells.
- the information extraction technology can also create a richer representation of table structure to provide a linking of cells to their respective row and column headers.
- Unstructured text might contain a statement such as “profit in 2015 for Company A was 2 million dollars.”
- a table might include a legend of “profits (million dollars),” a column header of “2015,” a row header of “Company A” and a cell value of “2.”
- the information extraction technology can overcome these challenges by recognizing tables, header cells, and cells that are merged or should be merged, creating a richer representation of table structures and providing a convenient way of linking cells to their respective row and column headers. Use of this richer representation allows extraction patterns to successfully pull out information from a wide variety of differently formatted tables.
- FIG. 1 A is a flow diagram illustrating a process 100 used in some implementations for extracting table information from semi-structured text.
- semi-structured text include documents in formats such as HTML or XML. These formats may provide a structure for tables based on tables containing one or more rows, which themselves contain one or more cells, but they may not provide relationships between cells and their respective row and column headers or other defining cells. Moreover, although these formats may allow for differentiation between row- and column-defining or header cells and data cells, in practice, many tables found in semi-structured documents fail to correctly apply these identifiers.
- process 100 can find one or more tables in a semi-structured portion of input document 101 .
- Stage 106 can involve looking for a structured element such as “table,” although other element names are also possible.
- process 100 can identify cell contexts. Each cell can be classified as either a header cell or a data cell. Header cells can be recognized based on one or more of the following: explicit coding of header cells in the input; formatting differences between header cells and other cells; the presence of at least one header cell for every column; the presence of horizontal lines; the nature of the cell content (blank vs. numeric vs. textual; lowercase vs. uppercase text); the presence of measurement units within brackets; words referring to operations on the values in the table (e.g., “sum,” “total,” “average,” “avg.”). Header cells can be further classified as being a column header cell, a row header cell or both, based on their position in the table. Any cell not recognized as a header cell may be considered to be a data cell.
- Each data cell can be linked to its respective column and row headers.
- a data cell can be linked to multiple column or row headers, for example, when the table has individual column headers for each column, and then other column headers spanning several columns. Header cells can also be linked to other header cells when there are multiple levels of headers.
- column and row headers can be encoded directly by annotating each data cell with the text of the column header cell(s) in its column and the text of the row header cell(s) in its row.
- the relation between a data cell and its respective row header(s) and column header(s) can be encoded indirectly.
- Each cell can be annotated with one or more identifiers for the rows that it spans and one or more identifiers for the columns that it spans.
- FIGS. 1 A- 1 D Annotated representation of tables 112
- the annotations can use any identifier, for instance, numeric identifiers (indexes) that reflect the position of that row or column within the table.
- FIG. 3 provides an example of an annotated representation of table.
- FIG. 5 D shows another possible implementation where these annotations are made available for inspection in tooltips (or other layers or portions of display) of tables and table cells. These tooltips pop up when a user hovers their mouse cursor over a table cell.
- FIG. 5 D shows a tooltip 532 for one of the table cells, displaying the annotated identifiers for the row and column that the cell occupies, and also another tooltip 534 for the entire table, showing a unique identifier for the table.
- Cells that span multiple rows or columns can be recognized according to the input format (e.g., XML CALS format, HTML, XHTML). In some implementations, cells that span multiple rows or columns may be expanded out. In some implementations, cells that span multiple rows or columns may have multiple headers, and headers that span multiple rows or columns are shared by multiple cells. In some implementations, cells including header cells that span multiple rows or columns may have multiple indexes corresponding to the individual rows and columns that they span.
- the input format e.g., XML CALS format, HTML, XHTML
- cells that span multiple rows or columns may be expanded out.
- cells that span multiple rows or columns may have multiple headers, and headers that span multiple rows or columns are shared by multiple cells.
- cells including header cells that span multiple rows or columns may have multiple indexes corresponding to the individual rows and columns that they span.
- process 100 can optionally merge two or more cells of a table.
- table structures may be corrected by merging cells and/or rows.
- FIG. 2 A shows an example table 200 in which the second row of column header text has been split into separate rows, 201 and 202 .
- One of the chemical descriptions has also been split across multiple cells, 203 and 204 .
- the first column of row headers contains empty cells, 205 and 206 , which reflects that the cell above should span the row with the empty cell as well.
- FIG. 2 B shows a table 250 , which is the result of merging cells from table 200 .
- Process 100 can merge cells based on one or more of the following: empty cells; rows with similar structure (e.g., where the same number of cells span the same columns); the distinguishing of header cells and data cells; the amount of text within a cell suggesting wrapping into the next cell; mismatched brackets in the text that would match if cells are merged; cell contents starting or ending with a conjunction or a preposition; or any combination thereof.
- stage 110 in FIG. 1 A can merge every cell in row 210 with the cell above it (row 209 ) due to these rows being rows of column headers and there being no empty cell in the bottom row ( 210 ), but many empty cells in the top row ( 209 ).
- Row 210 is merged into row 209 because column headers are typically vertically aligned to the bottom.
- stage 110 can merge every cell in row 212 with the cell above it due to these rows not being rows of column headers and there being multiple empty cells in row 212 .
- Row 212 can be merged into row 211 because the text in data cells tends to be vertically aligned to the top and the text in row 212 (“hexahydrate”) being lowercase suggests that it is not the start of a cell.
- stage 110 can prefer to merge them horizontally rather than vertically because this is the first row of column headers. Tables often contain main column headers spanning multiple columns, with sub-headers below.
- process 100 can optionally index the document into a format optimized for large-scale querying search and text mining. For example, an index can be created that allows fast searching of tabular information contained within millions of individual documents.
- Process 100 can manipulate the representation, which includes the annotations for the table cells, and these can be converted to different formats and optimized for different needs, where the annotations for the table cells are nevertheless preserved.
- One example of this is converting the representation into a format optimized for efficient search.
- the annotation process is automatic and results in annotations represented in a digital format that is amenable to further automatic manipulation, namely of the kind needed to facilitate computer-based search.
- the identifiers for the rows and columns of a table enable a search engine to find cells that occupy the same row or column by comparing these identifiers.
- process 100 can extract information from the table.
- an HTML table you can extract rows, but they will not always be correct.
- FIG. 3 shows an example annotated table, 300 , in which a cell, 301 , spans multiple rows.
- a table corresponding to table 300 can typically be represented by including the cell inside the first row (e.g., corresponding to row 302 ) that it spans, and annotating the cell as spanning two rows.
- the first row (e.g., corresponding to row 302 ) of the corresponding HTML table has three elements (e.g., corresponding to cells 301 , 304 and 305 ), and the second row (e.g., corresponding to row 303 ) has only two elements (e.g., corresponding to cells 306 and 307 ).
- a cell e.g., corresponding to cell 301
- it does not even appear within that row.
- finding the appropriate column headers for elements in the second row in the HTML representation can be particularly challenging, because counting alone will not be enough: you may need to adjust for any spanning issues.
- table 300 of FIG. 3 includes the column and row identifier annotations using the index term approach discussed above in relation to stage 108 .
- each cell can contain an annotation with respect to both the row and column to which the cell belongs.
- the content of a cell can be linked to its headers by finding matching index values. For example, assuming that cell 301 is a header cell, cell 307 would adopt cell 301 as a row header because they both have a row index of 2. Cells in the same row share the same row identifiers, so cell 307 becomes a member of rows 1 and 2.
- cells in the same column can have the same column identifiers, so any cell can be associated with other cells in that column (including the header cells for the column). For example, cell 307 shares the same column as cell 305 via having a column index of 3.
- FIG. 4 shows a process 400 used in some implementations in which associating cells in the same column via a shared index term column identifier can be achieved using a join operator, where the column indexes of pairs of cells are joined. This leaves only pairs of cells with the same column index (i.e., cells in the same column).
- Process 100 can restrict one of the cells in each pair to be a column header cell and the other to be a data cell, for example, by looking at the annotations. This process can find all pairs of cells such that one is a data cell, the other is a column header cell and they belong to the same column. Searching for cells in the same row is a similar process, with row indexes used instead of column indexes, and restricting one of the cells to be a row header instead of a column header.
- the extraction stage 116 can output the column header for the cell (readily available in the annotations) along with the cell contents (also readily available in the representation for the cell).
- constraints can be imposed on the content of the headers and the cell. This can be based on the type of the content, such as number, date, company, chemical description or disease. It can be based on a particular kind of disease such as “neoplasm” using an ontology, or a particular range (e.g., 1 to 100). It can also be based on pattern matching of the content using regular expressions or linguistic patterns.
- FIG. 1 B is a flow diagram illustrating a process 120 used in some implementations for extracting table information from unstructured text, for example, where the tables are initially in plain text.
- An example of such a document includes some types of electronic health records.
- an additional initial stage 102 can be performed to convert the unstructured text to a semi-structured representation similar to the one used as input for process 100 .
- Process 120 can then continue from stage 102 to stages 106 - 116 , discussed above in relation to process 100 .
- FIG. 8 A presents a table 800 represented in plain text, where the vertical alignment of the text is the only indication of the table's structure.
- FIG. 8 B shows the same table converted to HTML format 810 . This is a possible output of process 102 ( FIG. 1 B ).
- Stage 102 can include identifying one or more tables in unstructured text (e.g., which lines of text contain tables) and establishing the table structure (e.g., determining row and column boundaries).
- Process 120 can identify the tables by performing one or more of: identifying lines within the text; identifying multiple rows where text or white space is aligned; identifying table captions or headers; or any combination thereof.
- process 120 can establish table structures by one or more of the following: establishing the column boundaries based on the alignment of white space across rows; recognizing the columns that a header spans based on the alignment of the header with respect to the columns below; establishing cell contents according to alignment of contents and white space; or any combination thereof.
- FIG. 1 C is a flow diagram illustrating a process 130 used in some implementations for extracting table information from unstructured text, as in process 120 .
- Process 130 uses a single stage for identifying the tables and for establishing the table structure, and can then continue from stage 109 to stages 110 - 116 , discussed above in relation to process 100 .
- Stage 107 can include identifying one or more tables in unstructured text (e.g. which lines of text contain tables) and establishing some of the table structure (e.g. determining row and column boundaries).
- Process 130 can identify the tables by performing one or more of: identifying lines within the text; identifying multiple rows in which text or white space is aligned; identifying table captions or headers; or any combination thereof.
- Stage 109 can establish table structures by one or more of the following: establishing the number of columns based on the differences in the amount of white space between one column of text and the next; recognizing the columns that a header spans based on the alignment of the header vs. columns below; establishing cell contents according to alignment of contents and white space; or any combination thereof.
- Each cell can be classified as either a header cell or a data cell.
- Header cells are recognized based on one or more of the following: explicit coding of header cells in the input; formatting differences between header cells and other cells; the presence of at least one header cell for every column. Header cells can be further classified as being a column header cell, a row header cell or possibly both, based on their position in the table.
- Each cell can be linked to its respective column and row headers.
- a cell can be linked to multiple column or row headers, for example, when the table has individual headers for each column, and then other headers spanning multiple columns.
- column and row headers can be encoded directly by annotating each cell with the text of the column header cells in its column and the text of the row header cells in its row.
- the relation between a cell and its respective row headers and column headers can be encoded indirectly.
- Each cell can be annotated with one or more identifiers for the rows that it spans and one or more identifiers for the columns that it spans.
- FIGS. 1 A- 1 D can use any unique identifier, for instance, numeric identifiers (indexes) that reflect the position of that row or column within the table. These embodiments allow identification of any two cells in the same row or column, even if neither is a header cell.
- at least some of the column and row headers can be encoded directly and at least some can be encoded indirectly.
- cells that span multiple rows or columns may be expanded out.
- cells that span multiple rows or columns may have multiple headers, and headers that span multiple rows or columns are shared by multiple cells.
- cells including header cells that span multiple rows or columns may have multiple indexes corresponding to the individual rows and columns they span.
- FIG. 1 D is a flow diagram illustrating a process 140 used in some implementations for extracting table information using OCR to create semi-structured text. This could involve documents in PDF format (image or text).
- an additional initial stage 104 can be performed to run an OCR process on input document 101 , thereby creating a semi-structured representation similar to the one used as input for stage 106 in FIG. 1 A .
- the OCR process may create an unstructured document that can be provided to stage 102 in FIG. 1 B or stage 107 in FIG. 1 C .
- FIG. 5 A shows an example of a table 500 in a patent PDF document.
- Table 510 in FIG. 5 B shows a stylesheet rendering of an XML version of the same table.
- This XML might be derived from conversion of the PDF document, or, in some embodiments, the patent authority may provide an XML version of the patent document.
- FIG. 5 C shows a stylesheet rendering of XML (e.g., table 520 ) after the XML version of table 500 is processed in accordance with some embodiments of the presently disclosed technology. As illustrated in FIG. 5 C , the cells containing the text “ARC1172 (SEQ ID NO 222)” have been merged, and this entire piece of text is the row header for the values 17 and 3 in the data cells in the same row.
- FIG. 5 D shows embodiments of the presently disclosed technology where annotations to tables or table cells are made available for inspection in tooltips (or other layers or portion of display). These tooltips pop up when a user hovers their mouse cursor over a table cell.
- FIG. 5 D shows a tooltip 532 for one of the table cells, displaying the annotated identifiers for the row and column that the cell occupies, and also another tooltip 534 for the entire table, showing a unique identifier for the table.
- FIG. 6 shows an example 600 of data extracted from table 520 in FIG. 5 C .
- the example extraction can be performed in response to searches, queries, or other informational requests for the half-life values (T1 ⁇ 2) of the aptamers.
- the data extracted can be shown in HTML but can also be extracted into other formats such as Excel, XML, JSON, TSV and CSV.
- FIG. 7 provides evidence for the extracted information in example 600 .
- a user can be referred directly to the correct position of a table (e.g., table 700 ) with highlighting to show the different pieces of data that have been extracted.
- the computing devices on which the described technology may be implemented can include one or more central processing units, memory, input devices (e.g., keyboard and pointing devices), output devices (e.g., display devices), storage devices (e.g., disk drives) and network devices (e.g., network interfaces).
- the memory and storage devices are computer-readable storage media that can store instructions that implement at least portions of the described technology.
- the data structures and message structures can be stored or transmitted via a data transmission medium, such as a signal on a communications link.
- Various communications links can be used, such as the Internet, a local area network, a wide area network, or a point-to-point dial-up connection.
- computer-readable media can comprise computer-readable storage media (e.g., “non-transitory” media) and computer-readable transmission media.
- the word “or” refers to any possible permutation of a set of items.
- the phrase “A, B, or C” refers to at least one of A, B, C, or any combination thereof, such as any of the following: A; B; C; A and B; A and C; B and C; A, B, and C; or multiples of any item such as A and A; B, B, and C; A, A, B, C, and C; etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Document Processing Apparatus (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Much valuable information in documents is presented within tables. However, the information within tables is hard to extract automatically with high accuracy due to the wide variety and low quality of typical tables found in electronic documents. Information extraction technology can provide a method of extracting information from heterogeneous tables by recognizing tables, the header cells, and cells that are merged or should be merged, creating a richer representation of table structure and providing a convenient way of linking cells to their row and column headers. Use of this richer representation allows a few extraction patterns to successfully pull out information from a wide variety of differently formatted tables.
Description
This application is a reissue of U.S. Pat. No. 10,706,218, which issued on Jul. 7, 2020, entitled, “EXTRACTING INFORMATION FROM TABLES EMBEDDED WITHIN DOCUMENTS”, the contents of which is incorporated herein by reference in its entirety.
This application claims priority to U.S. Patent Application No. U.S. 62/337,216, entitled “EXTRACTING INFORMATION FROM TABLES EMBEDDED WITHIN DOCUMENTS,” filed May 16, 2016 which is incorporated by reference in its entirety.
Key information can be contained within tables that are themselves embedded in documents, whether full-text journal articles, patents, slides or health records. For example, important experimental results may be contained within a table in a PowerPoint presentation, or key lab values relevant to a patient may be contained within a table in an electronic health record. Information contained within tables is hard to extract automatically with high accuracy due to the wide variety and low quality of typical tables found in electronic documents.
One particular difficulty in extracting information contained within tables arises from the way in which table structures are typically represented in semi-structured formats like SGML, HTML, document or presentation formats such as Word or PowerPoint or various XML formats (e.g., XHTML, XML OASIS or CALS table models). Cells can span multiple rows or columns, and even for simple cells there is no association between the cell and its respective column and row headers.
Another difficulty arises from the fact that many tables found in electronic formats contain representation errors. These can arise from a variety of factors, including imperfect optical character recognition (OCR) and the breaking apart of cells to improve the readability of items within a table.
The following description and drawings are illustrative and are not to be construed as limiting. Numerous specific details are described to provide a thorough understanding of the disclosure. However, in certain instances, well-known or conventional details are not described in order to avoid obscuring the description. References to one or an embodiment in the present disclosure can be, but not necessarily are, references to the same embodiment; and, such references mean at least one of the embodiments.
Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosure. The various appearances of the phrase “in one embodiment” in the specification do not necessarily refer to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Moreover, various features are described that may be exhibited by some embodiments and not by others. Similarly, various requirements are described that may be requirements for some embodiments but not others.
The terms used in this specification generally have their ordinary meanings in the art, within the context of the disclosure, and in the specific context in which each term is used. Certain terms that are used to describe the disclosure are discussed below, or elsewhere in the specification, to provide additional guidance to the practitioner regarding the description of the disclosure. Certain terms may be highlighted, for example, by using italics and/or quotation marks. The use of highlighting has no influence on the scope and meaning of a term; the scope and meaning of a term is the same, in the same context, whether or not it is highlighted. It will be appreciated that the same thing can be said in more than one way.
Consequently, alternative language and synonyms may be used for any one or more of the terms discussed herein; no special significance is to be placed on whether or not a term is elaborated or discussed herein. Synonyms for certain terms are provided. A recital of one or more synonyms does not exclude the use of other synonyms. The use of examples anywhere in this specification, including examples of any terms discussed herein, is illustrative only, and is not intended to further limit the scope and meaning of the disclosure or of any exemplified term. Likewise, the disclosure is not limited to various embodiments given in this specification.
Without intent to further limit the scope of the disclosure, examples of instruments, apparatus, methods and their related results according to the embodiments of the present disclosure are given below. Note that titles or subtitles may be used in the examples for the convenience of a reader, which in no way should limit the scope of the disclosure. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. In the case of conflict, the present document, including definitions, will control.
Various examples of the invention will now be described. The following description provides certain specific details for a thorough understanding and enabling description of these examples. One skilled in the relevant technology will understand, however, that the invention may be practiced without many of these details. Likewise, one skilled in the relevant technology will also understand that the invention may include many other obvious features not described in detail herein. Additionally, some well-known structures or functions may not be shown or described in detail below, to avoid unnecessarily obscuring the relevant descriptions of the various examples.
The terminology used below is to be interpreted in the broadest reasonable manner, even though it is being used in conjunction with a detailed description of certain specific examples of the invention. Indeed, certain terms may even be emphasized below; however, any terminology intended to be interpreted in any restricted manner will be overtly and specifically defined as such in this Detailed Description.
The information extraction technology disclosed herein can provide a method of extracting information from heterogeneous tables in either semi-structured or unstructured text by recognizing headers and merged cells. The information extraction technology can also create a richer representation of table structure to provide a linking of cells to their respective row and column headers.
Information extraction is concerned with extracting relationships from unstructured and semi-structured text. Unstructured text might contain a statement such as “profit in 2015 for Company A was 2 million dollars.” A table might include a legend of “profits (million dollars),” a column header of “2015,” a row header of “Company A” and a cell value of “2.” Even when the table is annotated with XML or HTML elements, there can be several challenges to extracting the information, such as the following:
-
- 1) Many formats represent table cells as elements contained in table rows, and the table rows as elements contained in tables. The relationship between a cell and the table column it belongs to is not directly represented.
- 2) Cells, including header cells, may span multiple columns and rows.
- 3) Logical cells in tables are often split to aid visibility or during OCR.
- 4) Distinctions between value cells and header cells are often missing in text-based content sources.
As described in greater detail below, the information extraction technology can overcome these challenges by recognizing tables, header cells, and cells that are merged or should be merged, creating a richer representation of table structures and providing a convenient way of linking cells to their respective row and column headers. Use of this richer representation allows extraction patterns to successfully pull out information from a wide variety of differently formatted tables.
At stage 106, process 100 can find one or more tables in a semi-structured portion of input document 101. Stage 106 can involve looking for a structured element such as “table,” although other element names are also possible.
At stage 108, process 100 can identify cell contexts. Each cell can be classified as either a header cell or a data cell. Header cells can be recognized based on one or more of the following: explicit coding of header cells in the input; formatting differences between header cells and other cells; the presence of at least one header cell for every column; the presence of horizontal lines; the nature of the cell content (blank vs. numeric vs. textual; lowercase vs. uppercase text); the presence of measurement units within brackets; words referring to operations on the values in the table (e.g., “sum,” “total,” “average,” “avg.”). Header cells can be further classified as being a column header cell, a row header cell or both, based on their position in the table. Any cell not recognized as a header cell may be considered to be a data cell.
Each data cell can be linked to its respective column and row headers. A data cell can be linked to multiple column or row headers, for example, when the table has individual column headers for each column, and then other column headers spanning several columns. Header cells can also be linked to other header cells when there are multiple levels of headers. In some implementations, column and row headers can be encoded directly by annotating each data cell with the text of the column header cell(s) in its column and the text of the row header cell(s) in its row. In some implementations, the relation between a data cell and its respective row header(s) and column header(s) can be encoded indirectly. Each cell can be annotated with one or more identifiers for the rows that it spans and one or more identifiers for the columns that it spans.
These annotations, represented in FIGS. 1A-1D as Annotated representation of tables 112, are generated by process 108 in FIGS. 1A, 1B and 1D and process 109 in FIG. 1C . The annotations can use any identifier, for instance, numeric identifiers (indexes) that reflect the position of that row or column within the table. For example, FIG. 3 provides an example of an annotated representation of table. These embodiments allow identification of any two cells in the same row or column, even if neither is a header cell. In some implementations, at least some of the column and row headers can be encoded directly and at least some can be encoded indirectly. FIG. 5D shows another possible implementation where these annotations are made available for inspection in tooltips (or other layers or portions of display) of tables and table cells. These tooltips pop up when a user hovers their mouse cursor over a table cell. FIG. 5D shows a tooltip 532 for one of the table cells, displaying the annotated identifiers for the row and column that the cell occupies, and also another tooltip 534 for the entire table, showing a unique identifier for the table.
Cells that span multiple rows or columns can be recognized according to the input format (e.g., XML CALS format, HTML, XHTML). In some implementations, cells that span multiple rows or columns may be expanded out. In some implementations, cells that span multiple rows or columns may have multiple headers, and headers that span multiple rows or columns are shared by multiple cells. In some implementations, cells including header cells that span multiple rows or columns may have multiple indexes corresponding to the individual rows and columns that they span.
At stage 110, process 100 can optionally merge two or more cells of a table. In some implementations, table structures may be corrected by merging cells and/or rows. Despite the structured format, often the initial structuring provided in a semi-structured document is appropriate for the reading of the text, but does not reflect the logical structure of the table. FIG. 2A shows an example table 200 in which the second row of column header text has been split into separate rows, 201 and 202. One of the chemical descriptions has also been split across multiple cells, 203 and 204. The first column of row headers contains empty cells, 205 and 206, which reflects that the cell above should span the row with the empty cell as well. FIG. 2B shows a table 250, which is the result of merging cells from table 200. There is now a single header cell, 251, with the text “Measured Component,” rather than two cells, 207 and 208, where this text was separated. The column header with the text “Experiment,” 252, is now correctly aligned with the last three columns of the table. The row header with the text “II,” 253, is now correctly aligned with the last two rows. The chemical description, 254, is now in a single cell.
Process 100 can merge cells based on one or more of the following: empty cells; rows with similar structure (e.g., where the same number of cells span the same columns); the distinguishing of header cells and data cells; the amount of text within a cell suggesting wrapping into the next cell; mismatched brackets in the text that would match if cells are merged; cell contents starting or ending with a conjunction or a preposition; or any combination thereof. For table 200 in FIG. 2A , stage 110 in FIG. 1A can merge every cell in row 210 with the cell above it (row 209) due to these rows being rows of column headers and there being no empty cell in the bottom row (210), but many empty cells in the top row (209). Row 210 is merged into row 209 because column headers are typically vertically aligned to the bottom. In the case of rows 211 and 212, stage 110 can merge every cell in row 212 with the cell above it due to these rows not being rows of column headers and there being multiple empty cells in row 212. Row 212 can be merged into row 211 because the text in data cells tends to be vertically aligned to the top and the text in row 212 (“hexahydrate”) being lowercase suggests that it is not the start of a cell. In the case of the last cells of the first row (labeled 213, 214 and 215), stage 110 can prefer to merge them horizontally rather than vertically because this is the first row of column headers. Tables often contain main column headers spanning multiple columns, with sub-headers below.
At stage 114, process 100 can optionally index the document into a format optimized for large-scale querying search and text mining. For example, an index can be created that allows fast searching of tabular information contained within millions of individual documents. Process 100 can manipulate the representation, which includes the annotations for the table cells, and these can be converted to different formats and optimized for different needs, where the annotations for the table cells are nevertheless preserved. One example of this is converting the representation into a format optimized for efficient search. In some embodiments, the annotation process is automatic and results in annotations represented in a digital format that is amenable to further automatic manipulation, namely of the kind needed to facilitate computer-based search. The identifiers for the rows and columns of a table enable a search engine to find cells that occupy the same row or column by comparing these identifiers.
At stage 116, process 100 can extract information from the table. In an HTML table, you can extract rows, but they will not always be correct. For example, FIG. 3 shows an example annotated table, 300, in which a cell, 301, spans multiple rows. In HTML (not shown in FIG. 3 ), a table corresponding to table 300 can typically be represented by including the cell inside the first row (e.g., corresponding to row 302) that it spans, and annotating the cell as spanning two rows. This means that the first row (e.g., corresponding to row 302) of the corresponding HTML table has three elements (e.g., corresponding to cells 301, 304 and 305), and the second row (e.g., corresponding to row 303) has only two elements (e.g., corresponding to cells 306 and 307). In the HTML representation, although a cell (e.g., corresponding to cell 301), could be the header for the second row(e.g., corresponding to row 303), it does not even appear within that row. Moreover, finding the appropriate column headers for elements in the second row in the HTML representation can be particularly challenging, because counting alone will not be enough: you may need to adjust for any spanning issues.
In comparison, table 300 of FIG. 3 includes the column and row identifier annotations using the index term approach discussed above in relation to stage 108. In the index term approach, each cell can contain an annotation with respect to both the row and column to which the cell belongs. The content of a cell can be linked to its headers by finding matching index values. For example, assuming that cell 301 is a header cell, cell 307 would adopt cell 301 as a row header because they both have a row index of 2. Cells in the same row share the same row identifiers, so cell 307 becomes a member of rows 1 and 2. Similarly, cells in the same column can have the same column identifiers, so any cell can be associated with other cells in that column (including the header cells for the column). For example, cell 307 shares the same column as cell 305 via having a column index of 3.
In some implementations in which the row and column headers are directly copied into each cell, the extraction stage 116 can output the column header for the cell (readily available in the annotations) along with the cell contents (also readily available in the representation for the cell).
To extract particular relationships, constraints can be imposed on the content of the headers and the cell. This can be based on the type of the content, such as number, date, company, chemical description or disease. It can be based on a particular kind of disease such as “neoplasm” using an ontology, or a particular range (e.g., 1 to 100). It can also be based on pattern matching of the content using regular expressions or linguistic patterns.
Stage 102 can include identifying one or more tables in unstructured text (e.g., which lines of text contain tables) and establishing the table structure (e.g., determining row and column boundaries).
Process 120 can identify the tables by performing one or more of: identifying lines within the text; identifying multiple rows where text or white space is aligned; identifying table captions or headers; or any combination thereof.
Once identified, process 120 can establish table structures by one or more of the following: establishing the column boundaries based on the alignment of white space across rows; recognizing the columns that a header spans based on the alignment of the header with respect to the columns below; establishing cell contents according to alignment of contents and white space; or any combination thereof.
Stage 107 can include identifying one or more tables in unstructured text (e.g. which lines of text contain tables) and establishing some of the table structure (e.g. determining row and column boundaries).
Process 130 can identify the tables by performing one or more of: identifying lines within the text; identifying multiple rows in which text or white space is aligned; identifying table captions or headers; or any combination thereof.
Stage 109 can establish table structures by one or more of the following: establishing the number of columns based on the differences in the amount of white space between one column of text and the next; recognizing the columns that a header spans based on the alignment of the header vs. columns below; establishing cell contents according to alignment of contents and white space; or any combination thereof. Each cell can be classified as either a header cell or a data cell. Header cells are recognized based on one or more of the following: explicit coding of header cells in the input; formatting differences between header cells and other cells; the presence of at least one header cell for every column. Header cells can be further classified as being a column header cell, a row header cell or possibly both, based on their position in the table.
Each cell can be linked to its respective column and row headers. A cell can be linked to multiple column or row headers, for example, when the table has individual headers for each column, and then other headers spanning multiple columns. In some implementations, column and row headers can be encoded directly by annotating each cell with the text of the column header cells in its column and the text of the row header cells in its row. In some implementations, the relation between a cell and its respective row headers and column headers can be encoded indirectly. Each cell can be annotated with one or more identifiers for the rows that it spans and one or more identifiers for the columns that it spans.
These annotations, represented in FIGS. 1A-1D as Annotated representation of tables 112, can use any unique identifier, for instance, numeric identifiers (indexes) that reflect the position of that row or column within the table. These embodiments allow identification of any two cells in the same row or column, even if neither is a header cell. In some implementations, at least some of the column and row headers can be encoded directly and at least some can be encoded indirectly.
In some implementations, cells that span multiple rows or columns may be expanded out. In some implementations, cells that span multiple rows or columns may have multiple headers, and headers that span multiple rows or columns are shared by multiple cells. In some implementations, cells including header cells that span multiple rows or columns may have multiple indexes corresponding to the individual rows and columns they span.
Those skilled in the art will appreciate that the components illustrated in each of the flow diagrams discussed above may be altered in a variety of ways. For example, the order of the logic may be rearranged, sub-steps may be performed in parallel, illustrated logic may be omitted, other logic may be included, etc.
Several implementations of the disclosed technology are described above in reference to the figures. The computing devices on which the described technology may be implemented can include one or more central processing units, memory, input devices (e.g., keyboard and pointing devices), output devices (e.g., display devices), storage devices (e.g., disk drives) and network devices (e.g., network interfaces). The memory and storage devices are computer-readable storage media that can store instructions that implement at least portions of the described technology. In addition, the data structures and message structures can be stored or transmitted via a data transmission medium, such as a signal on a communications link. Various communications links can be used, such as the Internet, a local area network, a wide area network, or a point-to-point dial-up connection. Thus, computer-readable media can comprise computer-readable storage media (e.g., “non-transitory” media) and computer-readable transmission media.
As used herein, the word “or” refers to any possible permutation of a set of items. For example, the phrase “A, B, or C” refers to at least one of A, B, C, or any combination thereof, such as any of the following: A; B; C; A and B; A and C; B and C; A, B, and C; or multiples of any item such as A and A; B, B, and C; A, A, B, C, and C; etc.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Specific embodiments and implementations have been described herein for purposes of illustration, but various modifications can be made without deviating from the scope of the embodiments and implementations. The specific features and acts described above are disclosed as example forms of implementing the claims that follow. Accordingly, the embodiments and implementations are not limited except as by the appended claims.
Any patents, patent applications, and other references noted above are incorporated herein by reference. Aspects can be modified, if necessary, to employ the systems, functions, and concepts of the various references described above to provide yet further implementations. If statements or subject matter in a document incorporated by reference conflicts with statements or subject matter of this application, then this application shall control.
Claims (22)
1. A computing device implemented methodof extracting information from heterogeneous tables in semi-structured text and unstructured text, the method comprising steps of:
identifying, by a computing device, target content from a table in an electronic document, wherein the target content is presented in a plurality of cells table cell context within a document;
classifying, by the computing device, each table cell as the plurality of cells into one or more of a header cells and a plurality of cell or data cells cell based on at least one of explicit coding of the plurality of cells, formatting of the plurality of cells, relationship between the one or more header cells and columns in the table, presence of horizontal lines in the table, type of the target content in the plurality of cells, presence of measurement units within brackets in the table, and presence of words referring to mathematical operations on values in a table its context or content;
annotatingdirectly encoding, automatically by the computing device, the plurality of data cells cell with annotations to indicate their positions the data cell's position in the a table and an association between each of the plurality of data cells cell and the one or more header cells cell to enable extraction of the target content from the table; and
indexing, by the computing device, the electronic document utilizing the association between the plurality of data cells cell and the one or more header cells for responding to search queries cell.
2. The computing device implemented method of claim 1 , wherein the target content corresponds to semi-structured text that does not explicitly provide relationships between the plurality of cells and the one or more headers cellsnumeric identifiers are used to identify a position of each cell in the table within the document.
3. The computing device implemented method of claim 2 1 , wherein the electronic document is selected from one of HTML and XML documents and includes format tagsfurther comprising:
merging two or more cells to correct one or more table structures in the document.
4. The computing device implemented method of claim 1 , wherein the target content corresponds to plain text and the step of identifying the target content identifies at least one of lines within text, multiple rows where text or white space is aligned, and table captions or headers.
5. The computing device implemented method of claim 1 , further comprising a step of converting the target content into semi-structured text
wherein the header cell is also identified by formatting differences between the header cell and other cells in the document.
6. The computing device implemented method of claim 4 1 , wherein the step of converting comprisesfurther comprising steps of:
establishing a number of columns based on differences in an amount of white space between two columns of text; and
recognizing columns that the table captions or header spans, or establishing cell contents according to alignment of contents and white space.
7. The computing device implemented method of claim 1 , further comprising a step of classifying the one or more of header cells into one or more column header cells and one or more row header cells based at least partially on a position of the one or more header cells in the table:
indexing different formats of the document in an identical way to allow for more searches to be performed in the document.
8. The computing device implemented method of claim 7 1 , wherein the step of annotating associates each of the plurality of data cell with the one or more column header cells or the one or more row header cells cell is also identified by one or more words referring to operations.
9. A non-transitory computer-readable medium storing computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform a method of extracting information from heterogeneous tables in semi-structured text and unstructured text, the method comprising steps ofA computer program product comprising a tangible storage medium encoded with processor-readable instructions that, when executed by one or more processors, enable the computer program product to:
identifying, by a computing device, target content from a table in an electronic document, wherein the target content is presented in a plurality of cellsidentify table cell context within a document;
classifying, by the computing device, the plurality of cells into one or more of header cells and a plurality of data cellsclassify each table cell as a header cell or data cell based on its context or content;
annotatingdirectly encode, automatically by the computing deviceone or more processors, the plurality of data cells cell with annotations to indicate their positions the data cell's position in the a table, and an association between each of the plurality of data cells cell and the one or more header cells cell to enable extraction of the target content from the table; and
extractingextract, by the computing devicecomputer program product, the target content from the table utilizing the association between plurality ofthe data cellscell and one or more header cells for the target content extraction requests cell.
10. The non-transitory computer-readable mediumcomputer program product of claim 9 , wherein the electronic document is selected from one of HTML and XML documents and includes at least one of semi-structured text and unstructured textheader cell is identified by one or more measurement units within the header cell.
11. The non-transitory computer-readable mediumcomputer program product of claim 9 , wherein the step of classifying is based on at least one of explicit coding of the plurality of cells, formatting of the plurality of cells, relationship between the one or more header cells and columns in the table, presence of horizontal lines in the table, type of the target content in the plurality of cells, presence of measurement units within brackets the table, and presence of words referring to mathematical operations on values in a table.
12. The non-transitory computer-readable mediumcomputer program product of claim 9 , wherein the step of annotating indicates at least one of target content and position of the header cellstwo or more of the cells within the table are merged.
13. The non-transitory computer-readable medium of claim 9 , wherein the step of annotating indicates target content of the one or more header cells and a position of the one or more header cells.
14. The non-transitory computer-readable mediumcomputer program product of claim 9 , further comprising a step of generating a representation of the table utilizing the indications of the one or more header cellswherein structures within the table are corrected by merging cells and/or rows.
15. The non-transitory computer-readable mediumcomputer program product of claim 9 , further comprising a step of identifying one or more of the plurality of cells that span multiple columns or rowswherein the document is optimized for text mining.
16. The non-transitory computer-readable mediumcomputer program product of claim 15, further comprising a step of expanding the identified one or more cells9, wherein a format of the document is optimized to increase efficiency during one or more searches.
17. A computer system of extracting information from heterogeneous tables in semi-structured text and unstructured textconnected to a network, the system comprising:
one or more processors configured to:
identify, target content from a table in an electronic document, wherein the target content is presented in a plurality of cells table cell context within a document;
classify the plurality of cells into one or more row- or column-defining cells and a plurality of data cells table cell context as a header cell or data cell based on its context or content;
automatically annotate directly encode the plurality of data cell with annotations to indicate their positions the data cell's position in the a table and an association between each of the plurality of the data cells cell and the one or more row- or column-defining cells header cell to enable extraction of the target content from the table; and
generate a representation of a the table based at least partially on the association of each of between the plurality of data cells cell with one or more row- or column-defining defining cells and the header cell.
18. The computer system of claim 17 , wherein the one or more processors are further configured to classify the one or more row- or column-defining cells into a subset of column header cells and a subset of row header cells based at least partially on a position of the one or more row- or column-defining cells in the tablerows having a substantially similar structure are merged after the header cell and data cell have been annotated.
19. The computer system of claim 18 17 , wherein at least one of the one or more row- or column-defining cell is classified as both a column header cell and a row header cell the document is converted into different formats to increase efficiency in searching for information within the table.
20. The computer system of claim 18 17 , wherein the one or more processors are further configured to merge two or more of the plurality of data cells two or more other cells having unmatched brackets are merged within the table.
21. The system of claim 20 , wherein the two or more cells are merged based on at least one of the plurality of data cells within a proximity or an alignment of text being empty.
22. The computing device implemented method of claim 1 , wherein the annotations to indicate the data cell's position in the table comprise identifiers reflecting a row number and a column number of the table occupied by the data cell.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US17/859,132 USRE50675E1 (en) | 2016-05-16 | 2022-07-07 | Extracting information from tables embedded within documents |
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201662337216P | 2016-05-16 | 2016-05-16 | |
| US15/594,762 US10706218B2 (en) | 2016-05-16 | 2017-05-15 | Extracting information from tables embedded within documents |
| US17/859,132 USRE50675E1 (en) | 2016-05-16 | 2022-07-07 | Extracting information from tables embedded within documents |
Related Parent Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US15/594,762 Reissue US10706218B2 (en) | 2016-05-16 | 2017-05-15 | Extracting information from tables embedded within documents |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| USRE50675E1 true USRE50675E1 (en) | 2025-11-25 |
Family
ID=60297037
Family Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US15/594,762 Ceased US10706218B2 (en) | 2016-05-16 | 2017-05-15 | Extracting information from tables embedded within documents |
| US17/859,132 Active 2037-10-31 USRE50675E1 (en) | 2016-05-16 | 2022-07-07 | Extracting information from tables embedded within documents |
Family Applications Before (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US15/594,762 Ceased US10706218B2 (en) | 2016-05-16 | 2017-05-15 | Extracting information from tables embedded within documents |
Country Status (1)
| Country | Link |
|---|---|
| US (2) | US10706218B2 (en) |
Families Citing this family (63)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2018066144A1 (en) * | 2016-10-07 | 2018-04-12 | 富士通株式会社 | Program for generation of indexed data, method for generation of indexed data, system for generation of indexed data, search program, search method, and search system |
| US11775814B1 (en) | 2019-07-31 | 2023-10-03 | Automation Anywhere, Inc. | Automated detection of controls in computer applications with region based detectors |
| CN108470021B (en) * | 2018-03-26 | 2022-06-03 | 阿博茨德(北京)科技有限公司 | Method and device for positioning table in PDF document |
| US10878195B2 (en) * | 2018-05-03 | 2020-12-29 | Microsoft Technology Licensing, Llc | Automated extraction of unstructured tables and semantic information from arbitrary documents |
| US11693923B1 (en) | 2018-05-13 | 2023-07-04 | Automation Anywhere, Inc. | Robotic process automation system with hybrid workflows |
| US10831798B2 (en) * | 2018-09-20 | 2020-11-10 | International Business Machines Corporation | System for extracting header labels for header cells in tables having complex header structures |
| US10776573B2 (en) | 2018-09-20 | 2020-09-15 | International Business Machines Corporation | System for associating data cells with headers in tables having complex header structures |
| US11514258B2 (en) | 2018-09-20 | 2022-11-29 | International Business Machines Corporation | Table header detection using global machine learning features from orthogonal rows and columns |
| US11443106B2 (en) * | 2018-09-20 | 2022-09-13 | International Business Machines Corporation | Intelligent normalization and de-normalization of tables for multiple processing scenarios |
| US11762890B2 (en) * | 2018-09-28 | 2023-09-19 | International Business Machines Corporation | Framework for analyzing table data by question answering systems |
| CN109284495B (en) * | 2018-11-03 | 2023-02-07 | 上海犀语科技有限公司 | Method and device for performing table-free line table cutting on text |
| US11062704B1 (en) | 2018-12-21 | 2021-07-13 | Cerner Innovation, Inc. | Processing multi-party conversations |
| US11410650B1 (en) | 2018-12-26 | 2022-08-09 | Cerner Innovation, Inc. | Semantically augmented clinical speech processing |
| JP6758448B1 (en) * | 2019-04-15 | 2020-09-23 | 株式会社フィエルテ | Document analysis device, document analysis method and document analysis program |
| US11113095B2 (en) | 2019-04-30 | 2021-09-07 | Automation Anywhere, Inc. | Robotic process automation system with separate platform, bot and command class loaders |
| US11614731B2 (en) | 2019-04-30 | 2023-03-28 | Automation Anywhere, Inc. | Zero footprint robotic process automation system |
| US11243803B2 (en) | 2019-04-30 | 2022-02-08 | Automation Anywhere, Inc. | Platform agnostic robotic process automation |
| US11301224B1 (en) | 2019-04-30 | 2022-04-12 | Automation Anywhere, Inc. | Robotic process automation system with a command action logic independent execution environment |
| JP2021009591A (en) * | 2019-07-02 | 2021-01-28 | 株式会社日立製作所 | Data acquisition device, data acquisition method, and data acquisition program |
| CN112329452B (en) * | 2019-08-05 | 2024-11-26 | 珠海金山办公软件有限公司 | A method, device, computer storage medium and terminal for generating a chart |
| US11270065B2 (en) | 2019-09-09 | 2022-03-08 | International Business Machines Corporation | Extracting attributes from embedded table structures |
| US11380116B2 (en) * | 2019-10-22 | 2022-07-05 | International Business Machines Corporation | Automatic delineation and extraction of tabular data using machine learning |
| US11003847B1 (en) * | 2019-11-05 | 2021-05-11 | Sap Se | Smart dynamic column sizing |
| CN111062259B (en) * | 2019-11-25 | 2023-08-25 | 泰康保险集团股份有限公司 | Table identification method and apparatus |
| US11481304B1 (en) | 2019-12-22 | 2022-10-25 | Automation Anywhere, Inc. | User action generated process discovery |
| US11348353B2 (en) | 2020-01-31 | 2022-05-31 | Automation Anywhere, Inc. | Document spatial layout feature extraction to simplify template classification |
| US11514154B1 (en) | 2020-01-31 | 2022-11-29 | Automation Anywhere, Inc. | Automation of workloads involving applications employing multi-factor authentication |
| US11244203B2 (en) * | 2020-02-07 | 2022-02-08 | International Business Machines Corporation | Automated generation of structured training data from unstructured documents |
| US11182178B1 (en) | 2020-02-21 | 2021-11-23 | Automation Anywhere, Inc. | Detection of user interface controls via invariance guided sub-control learning |
| JP7468004B2 (en) * | 2020-03-11 | 2024-04-16 | 富士フイルムビジネスイノベーション株式会社 | Document processing device and program |
| CN111462327B (en) * | 2020-03-12 | 2022-12-13 | 成都飞机工业(集团)有限责任公司 | Unstructured data analysis method for three-dimensional inspection model of three-dimensional modeling software |
| US11782928B2 (en) * | 2020-06-30 | 2023-10-10 | Microsoft Technology Licensing, Llc | Computerized information extraction from tables |
| CN111626030A (en) * | 2020-07-28 | 2020-09-04 | 浙江明度智控科技有限公司 | Table differentiation content analysis method, system and storage medium for pharmaceutical industry |
| US12111646B2 (en) | 2020-08-03 | 2024-10-08 | Automation Anywhere, Inc. | Robotic process automation with resilient playback of recordings |
| US12423118B2 (en) | 2020-08-03 | 2025-09-23 | Automation Anywhere, Inc. | Robotic process automation using enhanced object detection to provide resilient playback capabilities |
| CN111913993B (en) * | 2020-08-12 | 2024-02-23 | 望海康信(北京)科技股份公司 | Table data generation method, apparatus, electronic device and computer readable storage medium |
| US12573227B2 (en) | 2020-10-05 | 2026-03-10 | Automation Anywhere, Inc. | Method and system for extraction of data from documents for robotic process automation |
| US11734061B2 (en) | 2020-11-12 | 2023-08-22 | Automation Anywhere, Inc. | Automated software robot creation for robotic process automation |
| CN112232048B (en) * | 2020-11-12 | 2024-08-20 | 腾讯科技(深圳)有限公司 | Form processing method based on neural network and related device |
| US11727215B2 (en) * | 2020-11-16 | 2023-08-15 | SparkCognition, Inc. | Searchable data structure for electronic documents |
| CN112328853A (en) * | 2020-11-26 | 2021-02-05 | 北京字跳网络技术有限公司 | Document information processing method, device and electronic device |
| US11734445B2 (en) | 2020-12-02 | 2023-08-22 | International Business Machines Corporation | Document access control based on document component layouts |
| US11599711B2 (en) | 2020-12-03 | 2023-03-07 | International Business Machines Corporation | Automatic delineation and extraction of tabular data in portable document format using graph neural networks |
| US11782734B2 (en) | 2020-12-22 | 2023-10-10 | Automation Anywhere, Inc. | Method and system for text extraction from an application window for robotic process automation |
| KR102815218B1 (en) | 2021-03-26 | 2025-06-02 | 한국전자통신연구원 | Method and apparatus for recognizing spo tuple relationship based on deep learning |
| CN113254627B (en) * | 2021-04-16 | 2023-07-25 | 国网河北省电力有限公司经济技术研究院 | Data reading method, device and terminal |
| CN113656592B (en) * | 2021-07-22 | 2022-09-27 | 北京百度网讯科技有限公司 | Data processing method and device based on knowledge graph, electronic equipment and medium |
| US11968182B2 (en) | 2021-07-29 | 2024-04-23 | Automation Anywhere, Inc. | Authentication of software robots with gateway proxy for access to cloud-based services |
| US11820020B2 (en) | 2021-07-29 | 2023-11-21 | Automation Anywhere, Inc. | Robotic process automation supporting hierarchical representation of recordings |
| US12097622B2 (en) | 2021-07-29 | 2024-09-24 | Automation Anywhere, Inc. | Repeating pattern detection within usage recordings of robotic process automation to facilitate representation thereof |
| CN113821691A (en) * | 2021-08-13 | 2021-12-21 | 安徽希施玛数据科技有限公司 | Document processing method and device, electronic equipment and readable storage medium |
| CN113869014A (en) * | 2021-08-25 | 2021-12-31 | 盐城金堤科技有限公司 | Extraction method and device of table data, storage medium and electronic equipment |
| US12197927B2 (en) | 2021-11-29 | 2025-01-14 | Automation Anywhere, Inc. | Dynamic fingerprints for robotic process automation |
| CN114186543B (en) * | 2021-12-06 | 2024-12-13 | 明度智云(浙江)科技有限公司 | A content analysis and extraction method, system and storage medium for drug experimental documents |
| CN114218233A (en) * | 2022-02-22 | 2022-03-22 | 子长科技(北京)有限公司 | Annual newspaper processing method and device, electronic equipment and storage medium |
| CN114707472A (en) * | 2022-03-11 | 2022-07-05 | 北京字跳网络技术有限公司 | Method and device for field merging and electronic equipment |
| US12536826B2 (en) | 2022-06-23 | 2026-01-27 | Automation Anywhere, Inc. | Computerized recognition of tabular data from an image |
| US12548360B2 (en) | 2022-09-15 | 2026-02-10 | Nielsen Consumer Llc | Methods, systems, articles of manufacture, and apparatus to tag segments in a document |
| US12602947B2 (en) | 2022-10-18 | 2026-04-14 | Automation Anywhere Inc. | Method and system for extracting data from documents and automatically modifying data item of the extracted data based on guidance retrieved from feedback file |
| CN115796137A (en) * | 2022-11-17 | 2023-03-14 | 华能招标有限公司 | Information extraction method and system for form data in document |
| CN115859926B (en) * | 2023-01-30 | 2023-05-16 | 天津联想协同科技有限公司 | Electronic form data relationship processing method and device, electronic equipment and medium |
| US11837004B1 (en) * | 2023-02-24 | 2023-12-05 | Oracle Financial Services Software Limited | Searchable table extraction |
| CN118839678B (en) * | 2024-09-20 | 2025-01-24 | 杭州恒生聚源信息技术有限公司 | Document information recall method, device, electronic device and storage medium |
Citations (21)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5848186A (en) * | 1995-08-11 | 1998-12-08 | Canon Kabushiki Kaisha | Feature extraction system for identifying text within a table image |
| US6006240A (en) * | 1997-03-31 | 1999-12-21 | Xerox Corporation | Cell identification in table analysis |
| US20030188258A1 (en) * | 2002-03-28 | 2003-10-02 | International Business Machines Corporation | System and method in an electronic spreadsheet for displaying and/or hiding range of cells |
| US20040237029A1 (en) * | 2003-05-22 | 2004-11-25 | Medicke John A. | Methods, systems and computer program products for incorporating spreadsheet formulas of multi-dimensional cube data into a multi-dimentional cube |
| US20060069696A1 (en) * | 2004-09-30 | 2006-03-30 | Microsoft Corporation | Method and implementation for referencing of dynamic data within spreadsheet formulas |
| US20060080596A1 (en) * | 2004-10-07 | 2006-04-13 | International Business Machines Corporation | Dynamic update of changing data in user application via mapping to broker topic |
| US20090144313A1 (en) * | 2007-12-04 | 2009-06-04 | Cognos Incorporated | Data entry commentary and sheet reconstruction for multidimensional enterprise system |
| US20140369602A1 (en) * | 2013-06-14 | 2014-12-18 | Lexmark International Technology S.A. | Methods for Automatic Structured Extraction of Data in OCR Documents Having Tabular Data |
| US8972437B2 (en) * | 2009-12-23 | 2015-03-03 | Apple Inc. | Auto-population of a table |
| US20150104077A1 (en) * | 2013-10-15 | 2015-04-16 | Samsung Electronics Co., Ltd. | Image processing apparatus and control method thereof |
| US20150142418A1 (en) * | 2013-11-18 | 2015-05-21 | International Business Machines Corporation | Error Correction in Tables Using a Question and Answer System |
| US20150363382A1 (en) * | 2014-06-13 | 2015-12-17 | International Business Machines Corporation | Generating language sections from tabular data |
| US20160055376A1 (en) * | 2014-06-21 | 2016-02-25 | iQG DBA iQGATEWAY LLC | Method and system for identification and extraction of data from structured documents |
| US20160078102A1 (en) * | 2014-09-12 | 2016-03-17 | Nuance Communications, Inc. | Text indexing and passage retrieval |
| US20160104077A1 (en) * | 2014-10-10 | 2016-04-14 | The Trustees Of Columbia University In The City Of New York | System and Method for Extracting Table Data from Text Documents Using Machine Learning |
| US20160103819A1 (en) * | 2014-10-10 | 2016-04-14 | Apple Inc. | Updating formulas in response to table transposition |
| US20160253982A1 (en) * | 2015-02-28 | 2016-09-01 | Microsoft Technology Licensing, Llc | Contextual zoom |
| US9449031B2 (en) * | 2013-02-28 | 2016-09-20 | Ricoh Company, Ltd. | Sorting and filtering a table with image data and symbolic data in a single cell |
| US20160334954A1 (en) * | 2006-12-28 | 2016-11-17 | Apple Inc. | Smart tables |
| US20170255628A1 (en) * | 2016-03-07 | 2017-09-07 | International Business Machines Corporation | Evaluating quality of annotation |
| US11461077B2 (en) * | 2004-11-26 | 2022-10-04 | Philip K. Chin | Method of displaying data in a table with fixed header |
-
2017
- 2017-05-15 US US15/594,762 patent/US10706218B2/en not_active Ceased
-
2022
- 2022-07-07 US US17/859,132 patent/USRE50675E1/en active Active
Patent Citations (21)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5848186A (en) * | 1995-08-11 | 1998-12-08 | Canon Kabushiki Kaisha | Feature extraction system for identifying text within a table image |
| US6006240A (en) * | 1997-03-31 | 1999-12-21 | Xerox Corporation | Cell identification in table analysis |
| US20030188258A1 (en) * | 2002-03-28 | 2003-10-02 | International Business Machines Corporation | System and method in an electronic spreadsheet for displaying and/or hiding range of cells |
| US20040237029A1 (en) * | 2003-05-22 | 2004-11-25 | Medicke John A. | Methods, systems and computer program products for incorporating spreadsheet formulas of multi-dimensional cube data into a multi-dimentional cube |
| US20060069696A1 (en) * | 2004-09-30 | 2006-03-30 | Microsoft Corporation | Method and implementation for referencing of dynamic data within spreadsheet formulas |
| US20060080596A1 (en) * | 2004-10-07 | 2006-04-13 | International Business Machines Corporation | Dynamic update of changing data in user application via mapping to broker topic |
| US11461077B2 (en) * | 2004-11-26 | 2022-10-04 | Philip K. Chin | Method of displaying data in a table with fixed header |
| US20160334954A1 (en) * | 2006-12-28 | 2016-11-17 | Apple Inc. | Smart tables |
| US20090144313A1 (en) * | 2007-12-04 | 2009-06-04 | Cognos Incorporated | Data entry commentary and sheet reconstruction for multidimensional enterprise system |
| US8972437B2 (en) * | 2009-12-23 | 2015-03-03 | Apple Inc. | Auto-population of a table |
| US9449031B2 (en) * | 2013-02-28 | 2016-09-20 | Ricoh Company, Ltd. | Sorting and filtering a table with image data and symbolic data in a single cell |
| US20140369602A1 (en) * | 2013-06-14 | 2014-12-18 | Lexmark International Technology S.A. | Methods for Automatic Structured Extraction of Data in OCR Documents Having Tabular Data |
| US20150104077A1 (en) * | 2013-10-15 | 2015-04-16 | Samsung Electronics Co., Ltd. | Image processing apparatus and control method thereof |
| US20150142418A1 (en) * | 2013-11-18 | 2015-05-21 | International Business Machines Corporation | Error Correction in Tables Using a Question and Answer System |
| US20150363382A1 (en) * | 2014-06-13 | 2015-12-17 | International Business Machines Corporation | Generating language sections from tabular data |
| US20160055376A1 (en) * | 2014-06-21 | 2016-02-25 | iQG DBA iQGATEWAY LLC | Method and system for identification and extraction of data from structured documents |
| US20160078102A1 (en) * | 2014-09-12 | 2016-03-17 | Nuance Communications, Inc. | Text indexing and passage retrieval |
| US20160104077A1 (en) * | 2014-10-10 | 2016-04-14 | The Trustees Of Columbia University In The City Of New York | System and Method for Extracting Table Data from Text Documents Using Machine Learning |
| US20160103819A1 (en) * | 2014-10-10 | 2016-04-14 | Apple Inc. | Updating formulas in response to table transposition |
| US20160253982A1 (en) * | 2015-02-28 | 2016-09-01 | Microsoft Technology Licensing, Llc | Contextual zoom |
| US20170255628A1 (en) * | 2016-03-07 | 2017-09-07 | International Business Machines Corporation | Evaluating quality of annotation |
Also Published As
| Publication number | Publication date |
|---|---|
| US20170329749A1 (en) | 2017-11-16 |
| US10706218B2 (en) | 2020-07-07 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| USRE50675E1 (en) | Extracting information from tables embedded within documents | |
| CA2823396C (en) | Storage of a document using multiple representations | |
| JP5144940B2 (en) | Improved robustness in table of contents extraction | |
| US10698937B2 (en) | Split mapping for dynamic rendering and maintaining consistency of data processed by applications | |
| US20170147566A1 (en) | Converting data into natural language form | |
| AU2012207560A1 (en) | Storage of a document using multiple representations | |
| US20170357625A1 (en) | Event extraction from documents | |
| US20170161255A1 (en) | Extracting entities from natural language texts | |
| Kovačević et al. | Automatic extraction of metadata from scientific publications for CRIS systems | |
| Eberius et al. | DeExcelerator: a framework for extracting relational data from partially structured documents | |
| CN110770735A (en) | Transcoding of documents with embedded mathematical expressions | |
| US20120221324A1 (en) | Document Processing Apparatus | |
| CA2884242C (en) | Automated composition evaluator | |
| US10896227B2 (en) | Data processing system, data processing method, and data structure | |
| Biswas et al. | Amazon Textract’s new Layout feature introduces efficiencies in general purpose and generative AI document processing tasks. Amazon Web Service (AWS) Blogs | |
| Luo et al. | Biotable: A tool to extract semantic structure of table in biology literature | |
| Wong et al. | Updating the ICE annotation system: Tagging, parsing and validation | |
| JPWO2006046665A1 (en) | Document processing apparatus and document processing method | |
| JP2023523761A (en) | pharmaceutical process | |
| CN121434323B (en) | Data error correction methods, data error correction devices and storage media | |
| Algahtani | Arabic named entity recognition: A corpus-based study | |
| US20240419643A1 (en) | Computer-implemented method for deduplication of equivalent data objects in a set of data objects, computer program product, and web-hosted software product | |
| Guo | Research on logical structure annotation in English streaming document based on deep learning | |
| Wang | Research on information extraction based on web table structure and ontology | |
| CN121435964A (en) | Multi-error aggregation text correction methods, devices, storage media, and program products |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |