WO2022113378A1

WO2022113378A1 - Table combining program, table combining system, and table combining method

Info

Publication number: WO2022113378A1
Application number: PCT/JP2020/048664
Authority: WO
Inventors: ジテンドラガンビール; ジュリアンタブレ; セルジオバルブエナ; アレクサンダルアンゲロフ; テオドルイワノフ
Original assignee: 株式会社KPMG Ignition Tokyo
Priority date: 2020-11-26
Filing date: 2020-12-25
Publication date: 2022-06-02
Also published as: JPWO2022113378A1

Abstract

The table combining program according to one or a plurality of embodiments of the present invention causes to be executed a table detection function for detecting a first table portion and a second table portion different from the first table portion from a digital document, an assessment function for assessing whether the first table portion and the second table portion can be combined, and a function for combining the first table portion and the second table portion when it is assessed that the first table portion and the second table portion can be combined.

Description

Table join program, table join system, and table join method

The disclosure herein relates to a table join program, a table join system, and a table join method for processing table data extracted from a table contained in a digital document.

Digital documents often include tables. In particular, IR materials such as financial statements include many tables. In order to analyze a digital document, it is desirable to efficiently extract table data from the tables contained in the digital document. The table data includes, for example, text data expressed in a table format. Conventional techniques for extracting table data from a table included in a digital document are described in, for example, Patent Document 1 and Non-Patent Document 1 below.

Japanese Unexamined Patent Publication No. 2020-170445

As described in Patent Document 1, when a digital document is created in PDF (Portable Document Form) format, a PDF file is converted into an image file as a method for extracting table data from a table, and the image thereof. A method of extracting table data based on a file (hereinafter referred to as "image-based method") and a method of reading text data and formatting data such as vertical ruled lines and horizontal ruled lines directly from a PDF file (hereinafter referred to as "direct method"). It is known.

In the image-based method, the table layout is recognized and the table data is extracted using the image file converted from the PDF file. Specifically, the table layout is recognized by analyzing image information such as the ruled lines of the table, and the table data (for example, the text contained in the table) is extracted by, for example, OCR (Optical Character Recognition / Reader). ..

In the direct method, the text is read directly from the PDF file, and the layout representing the positional relationship between the texts is recognized.

Digital documents include various types of tables. For example, some tables have at least one of the rows or columns separated by whitespace. In a table in which rows or columns are separated by such blanks, it is difficult to recognize the table layout by an image-based method.

If one table is too long in the horizontal direction to fit in the width of one page, or if the table is too long in the vertical direction to be displayed in one page, the one table is divided into multiple table parts. It may be displayed. In the conventional table data extraction technique, such a table divided into two table portions and displayed is recognized as two separate tables.

One of the objects of the invention disclosed herein is to solve or alleviate problems in conventional table data extraction techniques. One of the more specific objects of the invention disclosed herein is to provide a table join program capable of joining a plurality of table parts when one table is divided into a plurality of table parts in a digital document. That is.

Other objectives of the invention disclosed herein will become apparent by reference to the entire specification. The invention disclosed herein may solve the problems identified from the description of the present specification in place of or in addition to the above-mentioned problems.

The table join program according to one or more embodiments of the present invention has a table detection function for detecting a first table portion and a second table portion different from the first table portion from a digital document, and the first table portion and the first table portion. The determination function for determining whether the two table portions can be combined, and when it is determined that the first table portion and the second table portion can be combined, the first table portion and the second table portion are combined. To execute the function to be performed.

In one or more embodiments of the present invention, the first table portion is extracted from the first page, and the second table portion is extracted from the second page different from the first page.

The table join program according to one or more embodiments of the present invention is based on the first numerical data included in the first table portion and the second numerical data contained in the second table portion in one or more processors. Perform more functions to perform calculations.

The table join program according to one or more embodiments of the present invention provides the one or more processors with notes indicating to the one or more processors the result of a calculation based on the first numerical data and the second numerical data and at least one of the contents of the calculation. Further execute the function of adding information to the digital document.

In one or more embodiments of the present invention, when the first header information of the first table portion and the second header information of the second table portion are the same, the first table portion and the second table portion Is determined to be connectable.

In one or more embodiments of the present invention, there is no text between the first table portion and the second table portion in the digital document, or the first table portion and the second table portion. When only a predetermined predefined text exists between and, it is determined that the first table portion and the second table portion can be combined.

In one or more embodiments of the invention, the digital document is an unstructured document.

The annotation addition program according to one or more embodiments of the present invention has a function of causing one or more processors to extract a numerical value from an unstructured document, a function of causing a calculation based on the extracted numerical value, and the calculation. The function of adding annotation information indicating at least one of the result of the calculation and the content of the calculation to the digital document is executed. The annotation method according to one or more embodiments of the present invention is performed by one or more computer processors executing computer-readable instructions. The annotation addition method includes a step of extracting a numerical value from an unstructured document, a step of executing a calculation based on the extracted numerical value, and an annotation information indicating at least one of the result of the calculation and the content of the calculation. It includes a process to add to a digital document.

The table partitioning program according to one or more embodiments of the present invention allows one or more processors to analyze patterns of cells constituting a table contained in a digital document, and in any column or row of the table. When a repetition pattern appears, the function of dividing the table into sub-tables for each repetition is executed. The table partitioning method according to one or more embodiments of the present invention is performed by one or more computer processors executing computer-readable instructions. The table division method includes a step of analyzing a pattern of cells constituting a table included in a digital document, and a subtable for each iteration of the table when a repeating pattern appears in any column or row of the table. It is provided with a process of dividing into.

The table join system according to one or more embodiments of the present invention comprises one or more processors. The one or more processors detect the first table portion and the second table portion different from the first table portion from the digital document by executing a computer-readable instruction, and the first table portion and the first table portion. It is determined whether or not the two table portions can be combined, and when it is determined that the first table portion and the second table portion can be combined, the first table portion and the second table portion are combined. The table division method according to one or more embodiments of the present invention includes a step of estimating whether a cell constituting a table included in a digital document corresponds to a header element or a table data element, and the table is the header element. A first header column containing only cells corresponding to the above, a first data column containing only cells corresponding to the table data element, a second header column containing only cells corresponding to the header element, and the table data element. A step of dividing the table into two sub-tables with a boundary between the first data column and the second header column when the second data column containing only the cells corresponding to the above is included in this order. To prepare for.

The table join method according to one or more embodiments of the present invention is executed by one or more computer processors executing computer-readable instructions. The table joining method includes a step of detecting a first table portion and a second table portion different from the first table portion from a digital document, and a step of determining whether the first table portion and the second table portion can be joined. A step of joining the first table portion and the second table portion when it is determined that the first table portion and the second table portion can be combined is provided.

The user apparatus according to one or more embodiments of the present invention comprises one or more processors, wherein the one or more processors upload a digital document to a server by executing a computer-readable instruction. An annotated document to which an execution result of a calculation based on a numerical value extracted from a digital document and an annotation information indicating at least one of the contents of the calculation is added is acquired from the server, and the annotated document is displayed.

The table splitting program according to one or more embodiments of the present invention has a function of estimating whether a cell constituting a table included in a digital document corresponds to a header element or a table data element in one or more processors. , A first header column in which the table contains only cells corresponding to the header element, a first data column containing only cells corresponding to the table data element, and a second header containing only cells corresponding to the header element. When a column and a second data column containing only cells corresponding to the table data element are included in this order, the table is divided into two subs with the boundary between the first data column and the second header column as a boundary. The function to divide into tables and to execute. The method for creating an electronically annotated unstructured document according to one or more embodiments of the present invention includes a step of acquiring the unstructured document and a table 1 portion and a table 1 portion from the unstructured document. The step of detecting the second table portion different from the above, the step of determining whether the first table portion and the second table portion can be combined, and the step of determining whether the first table portion and the second table portion can be combined. Based on the step of connecting the first table portion and the second table portion, the first numerical data included in the first table portion, and the second numerical data included in the second table portion. Annotated by adding to the unstructured document the step of executing the calculation, the result of the calculation based on the first numerical data and the second numerical data, and the annotation information indicating at least one of the contents of the calculation. It comprises a step of generating an unstructured document.

According to an embodiment of the present invention, it is possible to provide a table joining program capable of joining a plurality of table parts when one table is divided into a plurality of table parts in a digital document.

It is a block diagram of the table coupling system by one Embodiment of this invention. FIG. 3 is an explanatory diagram showing two tables included in a digital document processed by the table-joining system of FIG. FIG. 3 is an explanatory diagram showing two tables included in a digital document processed by the table-joining system of FIG. It is explanatory drawing which shows the table included in the digital document processed by the table join system of FIG. 1, and the annotation information added to the table. It is explanatory drawing which shows the part of the join table joined by the table join system of FIG. 1, and the annotation information added to the join table. It is explanatory drawing which shows the part of the join table joined by the table join system of FIG. 1, and the annotation information added to the join table. It is a figure explaining the example of the table in which a part of the ruled line is represented by white space. It is a figure explaining another example of a table in which a part of a ruled line is represented by white space. It is a figure explaining another example of a table in which a part of a ruled line is represented by white space. It is a figure explaining another example of a table in which a part of a ruled line is represented by white space. It is a flow diagram which shows the flow of the table join process which joins a table included in a digital document with another table. It is a flow chart which shows the flow of the process of adding annotation information. It is explanatory drawing which shows the example of the table which can be divided in one Embodiment of this invention. It is explanatory drawing which shows the example of the table which is indivisible in one Embodiment of this invention. It is a figure which shows a part of the balance sheet which is an example of the table which can be divided according to one Embodiment of this invention.

Hereinafter, various embodiments of the present invention will be described with reference to the drawings as appropriate. As shown in FIG. 1, the table coupling system 1 according to the embodiment of the present invention includes a user device 10 and a server 20. The table coupling system 1 may include storage 30. The user device 10, the server 20, and the storage 30 are communicably connected to each other via the network 40. The network 40 may be a single network, or may be configured by connecting a plurality of networks. The network 40 is, for example, the Internet, a mobile communication network, and a combination thereof. As the network 40, any network that enables communication between electronic devices can be applied.

The table coupling system 1 shown in FIG. 1 is an example of a system to which the present invention can be applied, and the system to which the present invention can be applied is not limited to the system shown in FIG. The table coupling system 1 to which the present invention can be applied does not have to include some of the components shown. For example, the table join system 1 does not have to include the storage 30. The table coupling system 1 may include components (not shown). For example, although only one user device 10 is shown in FIG. 1 for the sake of brevity, the table coupling system 1 can include any number of two or more user devices 10.

The illustrated table-joining system 1 includes a user device 10 and a server 20 connected to a network 40, but one of the user device 10 and the server 20, which is a subcombination of the table-joining system 1, is claimed. Can be understood as a table-joining system to which the invention described in is applied. The table-joining system described in the claims may include the configuration and functions of both the user device 10 and the server 20 as constituent requirements, or one of the user device 10 and the server 20 which is a subcombination of the table-joining system 1. It is not necessary to make the configuration or function of one a constituent requirement and the other configuration or function a constituent requirement. In other words, when the table-joining system described in claim includes the configurations and functions of both the user apparatus 10 and the server 20 as constituent requirements, the table-joining system 1 including the user apparatus 10 and the server 20 is claimed. Corresponds to the embodiment of the invention according to the table coupling system described in. On the other hand, when the table-joining system described in the claim includes only the configuration and the function of the user device 10 as a constituent requirement, the user device 10 shown in the figure relates to the table-joining system described in the claim. Corresponds to an embodiment of the invention. Further, when the table-joining system described in the claims includes only the configuration and functions of the server 20 as a constituent requirement, the illustrated server 20 is the invention relating to the table-joining system described in the claims. Corresponds to the embodiment.

The table join system 1 identifies a table included in a digital document or a table portion that is a part thereof. In a digital document, if one table is too long to fit on one page, or if the table is too long to fit on one page, then the table is multiple tables. It may be divided into parts and displayed. The table join system 1 can join a plurality of table parts in which one table is divided into a plurality of table parts. A digital document suitable for handling by the table-joining system 1 is, for example, an unstructured document having no structure definition. The unstructured document is, for example, a document in PDF format.

First, the configuration of the user device 10 will be described. The user device 10 is a personal computer (PC), a tablet terminal, a smartphone, or various information processing devices other than these. The user apparatus 10 includes a processor 11, a memory 12, a user interface 13, a communication interface 14, and a storage 15.

The processor 11 is an arithmetic unit that loads an operating system and various other programs from the storage 15 or other storage into the memory 12 and executes instructions included in the loaded program. The processor 11 is, for example, a CPU, an MPU, a DSP, a GPU, various arithmetic units other than these, or a combination thereof. The processor 11 may be realized by an integrated circuit such as an ASIC, PLD, FPGA, MCU or the like.

The memory 12 is used to store instructions executed by the processor 11 and various other data. The memory 12 is a main storage device (main memory) that the processor 11 can access at high speed. The memory 12 is composed of, for example, a RAM such as a DRAM or an SRAM.

The user interface 13 includes an input interface that accepts user input and an output interface that outputs various information under the control of the processor 11. The input interface is, for example, a pointing device such as a keyboard or a mouse, a touch panel, or any information input device capable of inputting input from a user other than the above. The output interface is, for example, a liquid crystal display, a display panel, or any information output device capable of outputting the calculation result of the processor 11 other than the above.

The communication interface 14 is implemented as hardware, firmware, communication software such as a TCP / IP driver or PPP driver, or a combination thereof. The user device 10 can send and receive data to and from other devices such as the server 20 via the communication interface 14.

The storage 15 is an external storage device accessed by the processor 11. The storage 15 is, for example, a magnetic disk, an optical disk, a semiconductor memory, or various storage devices other than those capable of storing data. The storage 15 may store the digital document 15a. The digital document 15a may be, for example, an unstructured document such as a PDF file.

Next, the function of the user device 10 will be described. The processor 11 of the user device 10 functions as an upload unit 11a and a display unit 11b. The upload unit 11a uploads the digital document to the server 20. For example, the upload unit 11a can read the digital document 15a from the storage 15 and upload the read digital document 15a to the server 20.

The display unit 11b displays a digital document on the display. The display unit 11b may display, for example, the digital document 15a read from the storage 15a on the display. The display unit 11b may receive a digital document from the server 20 and display the digital document received from the server 20 on the display. The display unit 11b may receive the annotated document 25c described later from the server 20 and display it on the display.

Next, the configuration of the server 20 will be described. The server 20 includes a processor 21, a memory 22, a user interface 23, a communication interface 24, and a storage 25.

The processor 21 is an arithmetic unit that loads various programs for providing an operating system and an application into the memory 22 and executes instructions included in the loaded programs. The description of the processor 11 also applies to the processor 21, and the description of the memory 12 also applies to the memory 22.

The user interface 23 includes an input interface that accepts the input of the operator of the server 20 and an output interface that outputs various information under the control of the processor 21.

The communication interface 24 is implemented as hardware, firmware, communication software such as a TCP / IP driver or PPP driver, or a combination thereof. The server 20 can send and receive data to and from other devices via the communication interface 24.

The storage 25 is an external storage device accessed by the processor 21. The storage 25 is, for example, a magnetic disk, an optical disk, a semiconductor memory, or various storage devices other than those capable of storing data.

By executing the table join program 25a and the table join program for extracting the table data from the table or the table part included in the digital document and joining the table parts based on the extracted table data in the storage 25. The original document 25b, which is a digital document to be analyzed, and the annotated document 25c, which includes annotations added by executing the table join program, may be stored. The instructions included in the table join program 25a may be executed by the processor 21. Details of the functions realized by executing the table join program 25a will be described later.

The original document 25b is, for example, a digital document uploaded from the user device 10. The original document 25b can include one or more tables. As described above, the original document 25b may be an unstructured document such as a PDF file that does not have a structure definition. An unstructured document includes objects such as text and images contained in each page constituting the document, and coordinate information indicating the arrangement of the objects in the page, but does not include information indicating the structure of the document. ..

As will be described later, the annotated document 25c is a document in which annotation information indicating the content and result of calculation of numerical data included in the original document 25b is added to the original document 25b. The details of the annotation information will be described later. When the original document 25b is an unstructured document, the annotation information may be stored in the storage 25 as an object of the original document 25b.

In the table join system 1, there are no particular restrictions on the data storage location. For example, various data that can be stored in the storage 15 may be stored in a storage (for example, storage 25 or storage 30) or a database server that is physically separate from the user device 10. Similarly, various data that can be stored in the storage 25 may be stored in a storage (for example, storage 15 or storage 30) or a database server that is physically separate from the server 20. In FIG. 1, the storage 15 and the storage 25 are each shown as a single unit, but at least one of the storages 15 and 25 may be a collection of a plurality of physically separate storages. good. That is, in the present specification, the data stored in the storage 15 and the data stored in the storage 25 may be stored in a single storage or may be distributed and stored in a plurality of storages. Also, in the present specification and claims, the term "storage" may refer to either a single storage or a collection of multiple storages, as is permitted in the context.

Next, the function of the server 20 will be described. The processor 21 of the server 20 executes the instruction included in the table join program 25a or other instructions to execute the table detection unit 21a, the table data extraction unit 21b, the determination unit 21c, the connection unit 21d, the calculation unit 21e, and the calculation unit 21e. It functions as an annotation addition unit 21f.

The table detection unit 21a detects the table included in the digital document to be analyzed. For example, the table detection unit 21a can perform image processing such as rectangle detection processing on the digital document to be analyzed, and detect the rectangular elements included in the digital document as a table. The table detection unit 21a can detect a table from a digital document by any known method other than the rectangle detection process. The digital document to be analyzed is, for example, the original document 25b stored in the storage 25. The original document 25b may include a table divided into a plurality of table portions. For example, in the original document 25b, a table that is longer in the horizontal direction than the width of the page may be divided into a plurality of table portions. Further, in the original document 25b, when a table is too long in the vertical direction to be displayed on one page, the one table may be divided into a plurality of table portions and arranged across a plurality of pages. .. When one table is divided into a plurality of table portions, the table detection unit 21a detects each of the divided plurality of table portions as one table. In other words, the table detection unit 21a detects the table contained in the digital document without distinguishing whether the table contained in the digital document is the whole table or a part of the table. For example, when one table included in a digital document is divided into two parts, a first table part and a second table part, each of the first table part and the second table part is detected as a table. When the digital document contains a plurality of pages, the table detection unit 21a performs a process of detecting a table in each of the plurality of pages. The term "table" as used herein is used in a general sense. That is, the "table" in the present specification means that data such as characters and numbers are described and represented in cells separated by ruled lines. In the table, a part of the ruled line separating the cells may be omitted, and the cells may be separated by a blank white space. A cell in which one of the top, bottom, left, and right is separated by a space in this way may be called a space delimited cell (whitespace delimited cell).

The table data extraction unit 21b recognizes the layout of each table detected by the table detection unit 21a, and also extracts table data arranged in a plurality of cells constituting the table. The table data is characters, numbers, or other data arranged in each cell. The table data extraction unit 21b may store the extracted table data in the storage 25 for each table. The table data of each table may be stored in the storage 25 in association with the table identification information that identifies each table.

In one embodiment, the table data extraction unit 21b can extract each table data of the table detected by the table detection unit 21a by an image-based method, a text-based method, or a known method other than these. The text-based method is a method of converting a PDF file into a text file and extracting table data based on the text file. When extracting table data by an image-based method, the table data extraction unit 21b converts the original document 25b into an image file, and recognizes the table layout and extracts the table data using this image file. For example, the table data extraction unit 21b performs rectangular detection processing on an image file converted from a digital document, detects rectangular graphic elements included in the table, and recognizes the detected rectangular graphic elements as cells. .. Further, the table data extraction unit 21b detects the coordinate information, width, and height of each detected cell, and recognizes cells belonging to the same column and the same row based on the coordinate information. The cells recognized in this way are surrounded by ruled lines. Further, the table data extraction unit 21b detects the table data arranged in each cell by, for example, OCR (Optical Character Recognition / Reader).

The table data extraction unit 21b can determine whether or not the cells separated by the ruled lines include the blank-separated cells. For example, if a cell separated by a ruled line contains text divided into multiple rows or columns, there is a blank separator line between the multiple rows and / or between the columns. Can be determined. In this case, the table data extraction unit 21b can recognize a plurality of blank-separated cells in the cells separated by the ruled line, and can detect the table data contained in the blank-separated cells.

When extracting table data by the text-based method, the table data extraction unit 21b converts the document data into a text file and generates a text file in which the text contained in the text file is given the coordinates in the page. Then, the layout of the table is recognized by analyzing the positional relationship between the texts based on this coordinate information.

The table data extraction unit 21b can perform both an image-based method for extracting table data and a text-based method for extracting table data. The image-based method of extracting table data and the text-based method of extracting table data may be executed in parallel, or one of them may be executed sequentially before the other.

The determination unit 21c determines whether or not the plurality of tables detected by the table detection unit 21a can be combined with other tables. When a table that is originally supposed to be one table is divided into a plurality of table parts due to page layout or the like, the plurality of table parts can be combined. The determination unit 21c can determine whether or not one table detected by the table detection unit 21a can be combined with another table based on the table data detected by the table data extraction unit 21b. Specifically, the determination unit 21c determines that, for example, when the header information included in the table data of the two tables detected by the table detection unit 21a is the same as each other, the two tables can be combined. Can be done. However, even if the header information of the two tables is the same as each other, it is possible that the two tables were created with the intention of being different tables. Therefore, the determination unit 21c determines whether or not text data other than the table data is included between the two tables for which the header information is determined to be the same, and the text is included between the two tables. If not, it may be determined that the two tables can be combined. If there is no text between the two tables, each of the two tables is presumed to be a divided table portion of one table. Therefore, the determination unit 21c can determine that the two tables can be combined when there is no text between the two tables and the header information is the same between the two tables. Two tables may be spread across two different pages. In this case, between the two tables, the page number, the text outside the page description area (eg, header, footnote, footnote, document reference symbol number), the table legend (eg, "(eg," in the accounting document. Information that can be ignored as the content of a document such as "1 million yen)", a table legend), a table caption, or other documents may be arranged. Even if information that is not related to the contents of the document is placed between the two tables, if the two tables are displayed across consecutive pages (for example, one of the two tables is the first). If it is arranged on page 10 and the other of the two tables is arranged on page 11), it is likely that the two tables were created with the intention of being one table. Therefore, the determination unit 21c may determine that the two tables can be combined when the two tables are arranged on consecutive pages and the header information of the two tables is the same. Also, if the text that can be ignored is predetermined as the predefined text and only the predefined text is placed between the two tables and the header information of the two tables is the same, the two tables are combined. It may be determined that it is possible.

2 and 3 show examples of tables that can be combined. FIG. 2 shows Table T11 and Table T12. Table T11 and Table T12 are arranged, for example, on the same page of a digital document stored as the original document 25b. In the illustrated example, both Table T11 and Table T12 are table parts that form part of one statement of changes in shareholders' equity. The row header information T11a in the table T11 has the texts "balance at the beginning of the current period", "variable amount for the current period" ... "Balance", "Floating amount for the current period" ... "Balance at the end of the current period". Therefore, the row header information T11a in the table T11 and the row header information T12a in the table T12 match. Further, no text is arranged between the table T11 and the table T12. Therefore, the determination unit 21c can determine that the table T11 and the table T12 can be combined. For example, when one table cannot be displayed within the width of one page due to the limitation of the page width, one table may be divided into two tables and included in the digital document as in table T11 and table T12. be. In this way, the determination unit 21c can join the two tables divided in this digital document based on the row header information of each of the two tables in which one table is horizontally divided in the horizontal direction (horizontal). Can be combined).

Table T21 and Table T22 are shown in FIG. Both Table T21 and Table T22 show a portion of the consolidated cash flow statement. In FIG. 3, it is assumed that Table T21 is located on page 48 of the digital document and Table T22 is located on page 49 of the digital document. Table T21 column header information T21a is, in order from the left, "previous consolidated fiscal year (April 1, 2017 to March 31, 2018)" and "current consolidated fiscal year (April 1, 2018 to 2019)". The text is "March 31)", and the column header information T22a in Table T22 is also "previous consolidated fiscal year (from April 1, 2017 to March 31, 2018)" and "current consolidated fiscal year" in order from the left. (Own April 1, 2018 to March 31, 2019) ". Therefore, the column header information T21a in the table T21 and the column header information T22a in the table T22 match. Further, between the table T21 and the table T22, a text representing "-48-" indicating the page number and "unit: 1,000 yen" which is a legend of the consolidated cash flow statement is arranged. By registering the note of the table "unit: 1,000 yen" that frequently appears as the page number and the legend of the table as the predefined text, the determination unit 21c can determine that the table T21 and the table T22 can be combined. can. That is, the determination unit 21c has a list of text patterns used to classify the text as predefined text. For example, the text "-48-" has a text pattern of numbers sandwiched between "-" and can be classified into the category "page number". Each category of predefined text may be weighted. For example, the importance of the categories "page number", "document reference symbol", and "table legend" may be "non-important". When determining whether the two tables can be joined, the determination unit 21c determines whether all the text between the two tables belongs to the "non-important" category, and even so. If it can be determined that it can be combined. In another embodiment, even if the predefined text is not defined, the determination unit 21c is arranged on a continuous page (pages 48 and 49) of the table T21 and the table T22 having common header information. Therefore, it can be determined that the table T21 and the table T22 can be combined. For example, when one table cannot be displayed on one page due to the limitation of page height, one table may be divided into two tables and included in a digital document as in table T21 and table T22. In this way, the determination unit 21c can join (vertically) the two tables divided in this digital document based on the column header information of each of the two tables in which one table is vertically divided vertically. Can be combined).

The joining unit 21d joins two tables determined to be connectable by the determination unit 21c. For example, when the table T11 and the table T12 shown in FIG. 2 are determined by the determination unit 21c to be connectable, the connection unit 21d joins the table T11 and the table T12. By combining the table T11 and the table T12, a table different from the table detected by the table detection unit 21a is generated. In the present specification, a table obtained by combining two tables included in a digital document stored as an original document 25b may be referred to as a "combined table". The join unit 21d can store the data contained in the two tables before the join in the storage 25 in association with the table identification information for identifying the join table obtained by joining the two tables.

When the two tables divided in the horizontal direction (for example, the table T11 and the table T12 in FIG. 2) are determined to be connectable, the joining portion 21d uses the two tables divided in the horizontal direction. You can create a join table by horizontally joining. In a horizontally joined join table, the row data in the table before join is integrated. For example, when the table T11 and the table T12 shown in FIG. 2 are horizontally joined, the data and the table before the join are included in the row whose row header is "balance at the beginning of the current period" in the table T11 before the join. In T12, both of the data contained in the row whose row header is "balance at the beginning of the current period" are the data of the row whose row header is the "balance at the beginning of the current period" in the horizontally joined join table.

When it is determined that the two vertically divided tables (for example, the table T21 and the table T22 in FIG. 3) can be combined, the joining portion 21d uses the two vertically divided tables. You can create a join table by vertically joining. In a vertically joined join table, the column data in the table before join is integrated. For example, when the table T21 and the table T22 shown in FIG. 3 are vertically joined, the column header in the table T21 before the joining is "previous consolidated fiscal year (from April 1, 2017 to March 2018). The data included in the column "Month 31st)" and the column header in the table T22 before joining are included in the column whose column header is "Previous consolidated fiscal year (April 1, 2017 to March 31, 2018)". Both of the data will be the data of the row whose row header is "previous consolidated fiscal year (from April 1, 2017 to March 31, 2018)" in the vertically joined join table.

The joining portion 21d may join three or more tables. The criterion for joining three or more tables may be the same as the criterion for joining two tables. For example, if the row header information of the three tables matches each other and there is no text between the three tables, or only the predefined text, then the three tables are combined into one. It can be a join table. Three or more tables may be joined in stages. For example, when joining the first table, the second table, and the third table, first, the first table and the second table are joined to create an intermediate join table, and this intermediate join table is created. And the third table may be combined to generate the final combined table.

The calculation unit 21e stores the numerical values included in the table data extracted by the table data extraction unit 21b and / or the table identification information of the join table obtained by joining the two tables by the join unit 21d. Calculations can be performed based on the numerical values contained in the table data of the join table stored in 25. For example, if the digital document contains a balance sheet, the amount of assets contained in the assets section of the balance sheet is included in the table data of the balance sheet. The calculation unit 21e can calculate the total amount of assets by adding all the amounts of assets included in the table data of the balance sheet, for example. The calculation unit 21e can execute the calculation for each various subset of the data stored as the table data. For example, the total current assets can be calculated by summing the numerical values included in the items of current assets included in the table data of the balance sheet.

As described above, the calculation unit 21e can also perform the calculation based on the table data of the join table. For example, if the balance sheet is divided into two pages in a digital document, and the table parts that make up the balance sheet that are divided into two pages are not joined, it is based on the table data. The figures contained in the balance sheet cannot be calculated correctly. For example, the balance sheet is divided into two pages, and the table part placed on the preceding page contains all of the current assets and part of the fixed assets and is placed on the following pages. If the balance of fixed assets is placed in the table part, and if this two-divided balance sheet is not combined, calculate the total assets, which is the total of current assets and fixed assets. I can't. According to the embodiment of the present invention, the joining portion 21d joins two tables (two table portions in which the balance sheet is divided), which was originally one table like a two-divided balance sheet. Therefore, even if the digital document contains a divided table, the calculation can be performed correctly based on the table data of the combined table. Even if the digital document contains a balance sheet that is divided into two pages, according to the embodiment of the present invention, it is calculated by the amount of assets, the amount of liabilities and other calculations based on the numerical values contained in the balance sheet. The number to be done can be calculated correctly.

The annotation addition unit 21f can add annotation information indicating the result and / or content of the calculation performed by the calculation unit 21e based on the table data of the table to the digital document. When displaying a digital document with annotation information, in addition to the various objects contained in the original digital document, the object corresponding to the annotation information is in or near the table included in the digital document. Will be added. In the present specification, the object corresponding to the annotation information displayed together with the table of the digital document (or the object representing the annotation information) may be referred to simply as the annotation information. FIG. 4 shows a portion of the consolidated balance sheet T31 contained in the digital document. Annotation information is added to the consolidated balance sheet T31 of FIG. The annotation information includes arrows A1 to A10 indicating the range of calculation, a discrepancy mark S1 indicating that the calculation result by the calculation unit 21e does not match the data corresponding to the calculation result in the table data, and the calculation unit 21e. Includes a match mark S2 indicating that the calculation result according to is matched with the data corresponding to the calculation result among the table data.

Arrows A1 to A10 indicating the calculation range extend from the start point to the end point, and the numerical value of the cell between the start point and the end point of each arrow is the calculation target, and the calculation result is displayed in the cell where the end point has reached. It shows that it has been done. In the example of FIG. 4, for example, arrow A1 is the start cell of the item of current assets among the cells contained in the same column as the column header cell containing the text "previous consolidated fiscal year (March 31, 2018)". It extends from the "Cash and Deposits" cell) to the end cell of the Current Assets item ("Total Current Assets" cell). This arrow A1 indicates "cash and deposits", "notes and accounts receivable", and "inventories" between the start and end points of arrow A1 in the column of "previous consolidated fiscal year (March 31, 2018)". , "Accounts receivable", "Other", and "Allowance for doubtful accounts" are added, and the result of this addition is "Total current assets" corresponding to the end point of arrow A1. It is shown to be displayed in the cell of the row. However, since the numbers in the "allowance for doubtful accounts" row are marked with "△", the numbers in this "allowance for doubtful accounts" row are converted to minus and then added (that is, subtracted). Ru). The result calculated by the calculation unit 21e in the calculation range indicated by the arrow A1 is "38,545,156", and this calculation result is displayed in the cell in the row of "Total liquid assets" in the consolidated balance sheet T31. In the cell of the row of "Total liquid assets" in the column of "Previous consolidated fiscal year (March 31, 2018)", the calculation result of the calculation unit 21e and the original document 25b to be analyzed are displayed. A matching mark S2 is attached to indicate that the numerical values in the table included in the above match.

Arrows A2 to A10 also indicate the calculation range in the same way as arrow A1. For example, arrow A2 indicates the start cell of the item of current assets (“Cash and deposits”” among the cells included in the same column as the column header cell containing the text “Current consolidated fiscal year (March 31, 2019)”. Since it extends from the cell) to the end cell of the item of current assets (cell of "total current assets"), "cash and deposits", "notes receivable and" between the start point and the end point of the arrow are similar to arrow A1. The numbers contained in the cells corresponding to the "Accounts receivable", "Inventory assets", "Accounts receivable", "Others", and "Accounts for bad debts" rows are added, and the result of this addition is the end point of arrow A2. It shows that it is displayed in the cell of the row of "Total current assets" corresponding to. The result calculated by the calculation unit 21e in the calculation range indicated by the arrow A2 is "46,398,832", and this calculation result is displayed in the cell of the "Total current assets" row of the consolidated balance sheet T31. It is smaller by "1" than the "46,398,833". Therefore, in the cell of the row of "Total current assets" in the column of "Current consolidated fiscal year (March 31, 2019)", the calculation result of the calculation unit 21e and the table included in the original document 25b to be analyzed are included. A mismatch mark S1 is attached to indicate that the numerical values of are inconsistent with each other. Further, in order to show that the correct calculation result is smaller by "1" than the indicated numerical value, "-1" is displayed small as the difference information indicating the difference value together with the discrepancy mark S1. In addition to or instead of the difference information, the calculation result by the calculation unit 21e (“46,398,833” in the above example) may be displayed in the case of a mismatch.

In the balance sheet, the column of fixed assets includes an item showing a subtotal that is the sum of some of the items included in fixed assets. The arrow of the annotation information can also indicate the calculation range and the calculation result of such a subtotal. For example, in the consolidated balance sheet T31, "buildings and structures (net amount)" is a subtotal of the values of "buildings and structures" and the values of "accumulated depreciation" among the items included in fixed assets. be. Arrow A3 indicates the calculation range when calculating this "building and structure (net amount)". The result calculated by the calculation unit 21e in the calculation range indicated by the arrow A3 is "2,002,570", and this calculation result is in the row of "Buildings and structures (net amount)" in the consolidated balance sheet T31. Compared to "2,002,569" displayed in the cell, there is only "1" more. Therefore, in the cell of the row of "Buildings and structures (net amount)" in the column of "Previous consolidated fiscal year (March 31, 2018)", the calculation result of the calculation unit 21e and the original document 25b to be analyzed are displayed. A mismatch mark S1 is attached to indicate that the numerical values in the table included in the above are inconsistent. Further, in order to show that the correct calculation result is larger by "1" than the indicated numerical value, the difference information "1" is displayed small together with the mismatch mark S1.

Arrow A10 indicates the calculation range for calculating "total tangible fixed assets" among fixed assets. In the consolidated balance sheet T31 illustrated in FIG. 4, "total tangible fixed assets" includes "buildings and structures (net amount)", "mechanical equipment and carriers (net amount)", and "others (net amount)". ) ”,“ Land ”, and“ Construction in progress ”. Since "buildings and structures (net amount)", "mechanical equipment and carriers (net amount)", and "others (net amount)" are subtotals of some of the fixed asset items, the starting point of arrow A10. Between and the end point, both the item to be calculated and the item indicating the subtotal which is the calculation result of the calculation item are included. Therefore, if all the numerical values in the range from the start point to the end point of the arrow A10 are calculated, each item of the fixed asset will be calculated twice. For example, "buildings and structures (net amount)" is a subtotal obtained by adding the values of "buildings and structures" and the values of "accumulated depreciation" among the items included in fixed assets, so "tangible fixed assets". If all the "buildings and structures", "accumulated depreciation", and "buildings and structures (net amount)" included in the range of arrow A10 are included in the calculation for "total", then "buildings" And the structure "and" accumulated depreciation "will be calculated twice. Therefore, in order to calculate "total tangible fixed assets", "buildings and structures" and "accumulated depreciation" among the items included in arrow A10 are excluded from the calculation target. The arrow A10 indicates the section corresponding to the item to be calculated among the items between the start point and the end point by the solid line A10a, and the item to be excluded from the calculation target is indicated by the broken line A10b. As a result, the user of the table join system 1 can easily identify the item to be calculated based on the annotation information.

As described above, the calculation unit 21e can perform the calculation based on the table data of the join table in which the two tables are joined. The annotation addition unit 21f can add annotation information indicating the result and / or content of the calculation calculated based on the table data of the join table to the join table. 5a and 5b show the combined table CT1 (statement of changes in shareholders' equity) obtained by combining the tables T11 and T12 shown in FIG. 2 together with the annotation information. Of the annotation information shown in FIGS. 5a and 5b, arrow A11 is the start cell (the cell of "capital") among the cells contained in the same row as the row header cell containing the text "balance at the beginning of the period". To the end cell (cell of "total shareholders' equity"). By this arrow A11, each item between the start point and the end point of the arrow A11 is added in the row of "balance at the beginning of the current period", and the addition result is in the cell of the example of "total shareholders' equity" corresponding to the end point of the arrow A11. It is shown that it is displayed. However, since the "Balance at the beginning of the period" line also includes subtotals, some items are excluded from the calculation to prevent double calculation, and items excluded from this calculation. Arrow A11 is indicated by a dotted line in the corresponding cell. In the cells included in the calculation, arrow A11 is shown by a solid line. The result calculated by the calculation unit 21e in the calculation range indicated by the arrow A11 is "14,535,608", and this calculation result is the "total shareholders' equity" in the combined table CT1 shown in FIGS. 5a and 5b. Since it matches the value described in the cell, the match mark S2 is attached to the right side of the cell.

Since the combined table CT1 represents the statement of changes in shareholders'equity, the "total assets" at the end of the statement are the "total shareholders' equity" included in table T11 and the "total shareholders' equity" included in table T12. It is the total with "total of evaluation / conversion difference, etc." Therefore, the calculation based on the table data of the joined table CT1 is performed across the table data of the table T11 and the table data of the table T12. In FIGS. 5a and 5b, the arrow A12 indicates the calculation range for calculating the “total assets”. Therefore, the calculation range corresponding to the arrow A12 includes the “total shareholders' equity” included in the table T11. It is necessary to include the numerical value of “” and the numerical value of “total evaluation / conversion difference, etc.” included in Table T12. Therefore, the arrow A12 extending from the "capital" cell in table T11 does not end in table T11 and extends to table T12. As a display mode indicating that the arrow A12 straddles the table T11 and the table T12 (or the arrow A12 does not terminate at the right end of the table T11), as shown in FIG. 5a, the right end of the table T11 is as if the arrow A12. Can be expressed as if it is cut off in the middle, and the arrow A12 can be expressed as extending from the corresponding portion of the table T12. In this case, since the display mode at the right end of the table T11 is different between the arrow A11 and the arrow A12, the user can easily identify which is terminated and which is not terminated. Alternatively, at the right end of Table T11, the arrow A12 is marked with a predetermined continuation symbol (a circle, a square, and a symbol distinguishable from the end of an arrow other than the above), and the corresponding continuation in Table T12 is the same or corresponding. A symbol may be added to restart the display of the arrow A12. Again, the continuation symbol and the arrow mark indicating the end of the arrow are different so that the user can easily identify whether the arrow continues or not in the table below. As described above, even when one table is divided into a plurality of table portions in the original document 25b, the annotation information can correctly indicate the calculation range calculated across the plurality of table portions.

As described above, the annotation information added to the join table has been explained using the statement of changes in shareholders' equity as an example, but even in the join table other than the statement of changes in shareholders' equity, the annotation information is included in multiple join tables. It is possible to correctly show the calculation range calculated across the table part. For example, FIG. 3 shows a cash flow statement divided into two pages. If Tables T21 and T22, which are part of this cash flow statement, are combined to create a combined table, the "Cash and Cash Equivalent Increase" at the end of the cash flow statement is the table. It is the total of "cash flows from operating activities" and "cash flows from investing activities" included in T21 and "cash flows from financing activities" included in Table T22. Therefore, the calculation based on the table data of this joined table is performed across the table data arranged in the table T21 and the table data arranged in the table T22, and the annotation information is also the table T21 and the table. It is possible to correctly indicate the calculation range of the calculation performed across T22.

The digital document to which the annotation information is added may be stored in the storage 25 as the annotated document 25c.

The annotation information expressly described herein is merely an example of the annotation information applicable to the invention disclosed in the present application, and the annotation information applicable to the invention disclosed in the present application. The information is not limited to that specifically described herein.

Although not shown in FIG. 1, the processor 21 of the server 20 can execute a function of receiving a request for sending a digital document from the user device 10. The delivery request may include identification information that identifies the digital document. Upon receiving the request for sending a specific digital document, the processor 21 can read the digital document from the storage 25 and transmit the read digital document to the user device 10. As described above, the storage 25 may store the original document 25b, which is an original digital document, and the annotated document 25c, in which annotation information is added to the original digital document. The transmission request from the user apparatus 10 may include document type identification information for identifying whether the transmission of the original digital document or the digital document to which the annotation information is added is requested.

Next, the recognition of cells in a table containing blank-separated cells whose top, bottom, left, and right are separated by blanks will be described. FIG. 6 shows, as an example of a table containing blank delimited cells, a table T41 containing twelve cells delimited by blanks in a region delimited by horizontal ruled lines. When the table data extraction unit 21b extracts table data from this type of table by an image-based method, for example, the original document 25b including the table T41 (or a page including the table T41 which is a part thereof) is used as an image file. Convert and recognize the table layout using this image file. For example, the table data extraction unit 21b detects a ruled line in an image file converted from a digital document, and recognizes a rectangular area between the ruled lines as a temporary cell. In the example of FIG. 6, the ruled lines L1 to L4 are detected, and the area between the ruled lines is recognized as the temporary cells T41a to T41c.

Next, the table data extraction unit 21b determines whether or not the temporary cells between the ruled lines have rows and columns separated by blanks instead of the ruled lines. For example, when the table data extraction unit 21b has a plurality of texts that are laterally separated from each other in the temporary cell, the temporary cell has columns separated by blanks instead of ruled lines. Can be recognized. Focusing on the arrangement of the text in the horizontal direction of the temporary cell T41b, the text of "buildings and structures", the text of "8,994,000 yen", and "44,049" are inside the temporary cell T41b. Since a group of texts of "thousand yen" are arranged so as to be separated from each other in the horizontal direction, it is recognized that there are three columns separated by blanks inside the temporary cell T41b. Similarly, it is recognized that the temporary cell T41a and the temporary cell T41c also have three columns separated by blanks.

Focusing on the arrangement of texts in the vertical direction of the temporary cell T41b, the leftmost column inside the temporary cell T41b contains a set of texts "buildings and structures" and "mechanical devices and carriers". Since the text of "Other tangible fixed assets" and the text of "Software" are arranged vertically apart from each other, they are inside the temporary cell T41b. It is recognized that there are four lines separated by blanks.

As described above, the table data extraction unit 21b recognizes that a cell separated by a blank of 4 rows × 3 columns exists inside the temporary cell T41b. Similarly, it is recognized that there are cells separated by blanks in 1 row × 3 columns inside the temporary cell T41a and inside the temporary cell T41c, respectively. The table data extraction unit 21b detects the table data arranged in the cells separated by the blanks recognized as described above by, for example, OCR.

The calculation unit 21e can execute the calculation based on the table data of the cells separated by the blanks detected as described above. The calculation unit 21e can also execute the calculation using both the table data of the cells separated by the blank and the table data of the cells separated by the ruled line. The annotation addition unit 21f can add and display the annotation information indicating the result and / or the content of the calculation performed by the calculation unit 21e to the table T41 including the cells separated by blanks. In Table T41 of FIG. 6, an arrow indicating a calculation range and a mismatch mark are added.

In each of the tables shown in FIGS. 7 to 9, similarly to FIG. 6, cells separated by blanks exist in the area divided by the ruled line. The table data extraction unit 21b can recognize the cells separated by blanks in each of the tables shown in FIGS. 7 to 9 by the same method as the detection of the cells separated by blanks in FIG. .. Annotated information indicating the result and / or content of the calculation of the table data arranged in the cells separated by blanks is added to each of the tables of FIGS. 7 and 8. In the table of FIG. 9, the calculation is performed in both the vertical and horizontal directions, and the table is supplemented with annotation information indicating the result and / or the content of the calculation.

Subsequently, with reference to FIG. 10, a table joining process for joining two or more tables included in a digital document will be described. First, in step S11, the table included in the digital document to be processed is detected. The process of this step S11 is performed by, for example, the above-mentioned table detection unit 21a.

Next, in step S12, table data is extracted by a text-based method for each of the tables detected in step S11, and in step S13, table data is extracted by an image-based method for each of the tables detected in step S11. To. Steps S12 and S13 may be performed in parallel, and step S13 may be performed before step S12. In step S13, cells that are not divided by the ruled line (cells that are separated by a blank) that are arranged in the area divided by the ruled line are detected, and are arranged in the cells that are separated by the blank. Table data may be extracted. One of steps S12 and S13 may be omitted.

Next, in step S14, it is determined for each cell whether or not there is duplication between the table data extracted in step S12 and the table data extracted in step S13 (consistency check). If there is duplication, the table data extracted by the image-based method in step S13 is adopted, and the adopted table data is stored in the storage 25 as table data for the table. Therefore, when the calculation process or the annotation addition process using the table data is performed, the table data extracted by the image-based method is used. When extracting table data by the image-based method, the table data extracted by the image-based method takes precedence over the table data extracted by the text-based method. This is because (vertical outer borders and horizontal outer borders) are clearly defined and therefore more accurate as a table extraction. Having such a line around the table allows the table to be extracted more accurately from the text around the table. In step S14, for example, when the position of the table detected in step S12 in the digital document and the position of the table detected in step S13 in the digital document match, the same table is duplicated and extracted. It is determined that it has been done. For example, in both steps S12 and S13, if there are tables at the same or close positions on page 10 of the digital document and the tables are the same or similar in size, the two. The table is determined to be duplicated.

In step S14, even if the table detected in step S12 and the table detected in step S13 are at the same position, if the number of detected cells is different, the number of detected cells is different. The table data with the largest number may be adopted. For example, Table T41 shown in FIG. 6 contains a large number of cells separated by blanks. For a table containing many cells separated by blanks, some cells may not be recognized correctly even if the cells separated by blanks are detected in the image-based method that detects cells based on the ruled line. There is sex. Therefore, if more cells are detected in step S12 in which the table data is detected by the text-based method than in step S13 in which the table data is detected by the image-based method, the cells are detected by this text-based method. The created table data may be adopted.

Next, in step S15, the table data of each table extracted in step S12 or S13 is stored in the storage 25 in association with the table identification information of each table. If it is determined in step S14 that the table data detected in step S12 and the table data detected in step S13 overlap, only the table data detected in step S13 is stored in the storage 25, and in step S12. The detected table data may be discarded.

Next, in step S16, it is determined whether or not each of the tables included in the digital document to be processed can be combined with other tables. For example, in step S16, another table having header information that matches the header information of one table included in the digital document is selected as a join candidate. For example, another table having row header information or column header information that matches at least one of the row header information and the column header information of one table included in the table detected in step S11 is a table of candidate joins. Is selected as. In step S16, it is further determined whether or not to perform the coupling with the selected coupling candidate. For example, if there is no text or only predefined text between a table contained in a digital document and a table selected as a candidate to join the table, join the table with the candidate join. It is determined to do. The determination process in step S16 may be performed by the determination unit 21c described above.

In step S16, if there is no text or only predefined text between any two of the tables detected in step S11, the two tables are used as a pair of candidate joins. You may choose. In this case, it is determined whether or not at least one of the row header information and the column header information matches between the two tables selected as the join candidate pair, and if they match, the join candidate pair can be joined. It is judged.

If the table that can be joined is not specified in step S16, the table join process ends. When a set of tables that can be joined is specified in step S16, two (or three or more) tables (table portions) determined to be joinable are joined. When two tables are joined, the join table to which the two tables are joined is assigned a table identification number that identifies the join table, and the table data of the two tables before the join is associated with the table identification number. Is remembered. The process of joining this table may be performed by the above-mentioned joining portion 21d. In step S17, when all the tables that can be joined are joined, the table join process ends.

Next, with reference to FIG. 11, the flow of annotation addition processing for adding annotation information indicating the result and / or content of the calculation of the numerical values included in the table to the table will be described. In the annotation addition processing for the digital document shown in FIG. 11, the table data of the table included in the digital document is extracted and used as the calculation target at the start of the processing or at an appropriate time after the start of the annotation addition processing. It is assumed that it is possible to do so. First, in step S21, numerical data indicating a numerical value is extracted from the table data of the table included in the digital document to be processed. Next, in step S22, a predetermined calculation is performed based on the numerical value extracted in step S21. For example, if the digital document to be processed contains the balance sheet shown in table T31, the numbers extracted in step S21 are calculated according to the rules assumed in the balance sheet. For example, as shown in FIG. 4, the individual item of "current assets" ("cash and deposits") contained in the row between the row header "current assets" and the row header of "total current assets". The numerical values arranged in the cells corresponding to)) are added, and the value corresponding to the “total current assets” is calculated. The calculation rules can be set in advance. For example, if the row header contains the term "total", the row that is included in the table as a unit of the rows that precede the row header that contains the term "total". It is possible to set a rule to add the numerical values included in. Applying this rule to Table T31, the six rows preceding the row header "Total Current Assets" (the rows from "Cash and Deposits" to "Provision for Credit Losses") are grouped together in Table T31. Since it is represented, it is possible to set a rule to add the numerical values contained in these rows for each column.

Next, in step S23, annotation information indicating the result and / or content of the calculation in step S22 is generated, and the generated annotation information is displayed together with the table. For example, when annotating the balance sheet shown in FIG. 4 is performed, the calculation result of “total current assets” calculated in step S22 and the “total current assets” described in the table are displayed. It is determined whether or not the numerical values in the cells included in the corresponding row match, and the mismatch mark S1 or the match mark S2 is displayed according to the result of the determination. In the example shown in FIG. 4, a numerical value that matches the calculation result in step S22 is displayed in the cell of the row of "Total current assets" in the column of "Previous consolidated fiscal year (March 31, 2018)". Therefore, the match mark S2 is displayed in or near the cell. On the contrary, the cell in the row of "Total current assets" in the column of "Current consolidated fiscal year (March 31, 2019)" displays a numerical value that does not match the calculation result in step S22. The mismatch mark S1 is displayed in or near the cell. Further, the calculation range is indicated by arrows A1 to A10. By adding the annotation information to the table in this way, the annotation addition process ends.

The annotation addition process may be performed on each of the tables included in the digital document, or may be performed on the joined table generated by joining two or more tables included in the digital document. That is, the annotation processing disclosed herein does not presuppose that the tables contained in the digital document are combined. For this reason, the table join system 1 is described by the name of "table join" system for convenience of explanation, but does not necessarily perform table join. For example, in the table joining system 1 shown in FIG. 1, when the annotation addition processing is performed without joining the tables included in the digital document, the function of at least one of the determination unit 21c and the joining unit 21d is performed. It is not necessary to have. In this case, the table join system 1 does not have a function of joining the tables included in the digital document, but can perform annotation processing on the table.

Subsequently, another aspect of the invention disclosed in the present specification will be described with reference to FIGS. 12 and 13. The processor 21 of the table join system 1 divides the table contained in the digital document in addition to the function of joining the tables contained in the digital document or in place of the function of joining the tables contained in the digital document. It may function as a unit (not shown). For example, as shown in FIG. 12, the table partitioning unit determines whether or not a repetition pattern exists in the header information of the columns of the original table included in the digital document as one table, and repeats. If it is determined that a pattern exists, the function of dividing the original table into subtables for each iteration may be executed. For example, in the table T51 shown in FIG. 12, C1 to C5 appear repeatedly three times as the header information of the column. In this case, the processor 21 uses the ruled line L11 between the column header cell in which C1 which is the starting point of the repetition is arranged and the column header cell in which C5 which is the ending point of the repetition is arranged as the cutting position. This table may be divided horizontally into three tables (sub-tables) T51-1 to T51-3. Each of the subtables T51-1 to T51-3 thus divided contains only one unit repeat pattern consisting of five columns corresponding to the column headers of C1 to C5. In FIG. 12, an example in which the repetition pattern appears in the column header information has been described, but the repetition pattern may appear in the row header information. In this case, the original table is vertically divided for each repeating pattern.

Table identification information that identifies each subtable may be added to each of the divided subtables. The table data extraction unit 21b may extract table data for each of the divided sub-tables and store the extracted table data in the storage 25 for each sub-table. The table data of each sub-table may be stored in the storage 25 in association with the table identification information that identifies each sub-table. The calculation unit 21e can execute the calculation for each divided subtable based on the numerical value included in the table data of each subtable. The annotation addition unit 21f can add and display annotation information indicating the result and / or the content of the calculation performed by the calculation unit 21e for each divided subtable.

On the other hand, in Table T61 shown in FIG. 13, C1 to C5 are repeated twice, but C3 appearing the first time counting from the left is the head of a plurality of sub-columns included in the column of GC2 (leftmost). ), While C3 appearing for the second time counting from the left is arranged at the right end of a plurality of sub-columns included in the column in which GC3 is set as header information. Therefore, it is determined that Table T61 does not have a repeatable repeat pattern.

The repeating pattern that is the unit for dividing the table can be detected by analyzing the text contained in the column header cell or the row header cell. For example, the balance sheet shown in FIG. 14 has a column header cell containing the text "Subject" and a column header cell containing the text "Amount" in the "Assets" and "Assets" columns, respectively. Includes. Therefore, the table partitioning unit can determine that the column header cell of "subject" and the column header cell containing the text of "amount" are repeated in the balance sheet. Then, the table division unit can divide the balance sheet into a subtable consisting of columns of "asset part" and a subtable consisting of "liability part" according to this determination.

Further, the processor 21 may have a function of estimating whether each of the cells constituting the table included in the digital document corresponds to the header cell including the header information or the table data cell including the table data. In the processor 21, the table includes a first header column containing only a header cell, a first data column containing only a table data cell, a second header column containing only a header cell, and a second data column containing only a table data cell. , In this order, the table may be divided into two sub-tables with the boundary between the first data column and the second header column as a boundary. See FIG. 14 again. In the balance sheet of FIG. 14, the "asset section" is placed on the left side and the "account section" is placed on the right side. The "asset section" is divided into a "subject" column containing only header information and a "amount" column containing only the content (data) of this header information. Similarly, the "debt section" is also divided into a column of "subjects" containing only header information and a column of "amount" containing only the content (data) of this header information. Therefore, the balance sheet can be divided into two sub-tables with a boundary between the "Amount" column of the "Assets" section and the "Items" column of the "Assets" section. By a method other than the method of recognizing the repetition pattern as described above, one table included in the digital document can be divided into two or more sub-tables.

Even if it is described that the processes and procedures described herein are performed by a single device, software, component, module, such processes or procedures may be performed by multiple devices, multiple software, multiple devices. Can be performed by a component of, and / or multiple modules. Also, even though it is described that the data, tables, or databases described herein are stored in a single storage device (storage or memory), such data, tables, or databases are simply. It may be distributed and stored in a plurality of storage devices provided in one device or in a plurality of storage devices distributed and arranged in a plurality of devices. Further, the software and hardware elements described herein can also be realized by integrating them into fewer components or by breaking them down into more components.

Notations such as "first", "second", and "third" in the present specification are attached to identify components, and do not necessarily limit the number, order, or contents thereof. do not have.

The components represented in the singular form in this specification shall include the plural form as long as they do not cause a contradiction.

Cross-reference This application claims priority under Japanese Patent Application 2020-196428 and Japanese Patent Application 2020-196431 filed on 26 November 2020 and includes all of these applications by citation.

1 Table join system 10 User equipment 20 Server

Claims

For one or more processors
A table detection function that detects the first table part and the second table part different from the first table part from the digital document,
A determination function for determining whether the first table portion and the second table portion can be combined, and
A function of joining the first table portion and the second table portion when it is determined that the first table portion and the second table portion can be combined.
A table join program that runs.
The first table portion is extracted from the first page, and the second table portion is extracted from the second page different from the first page.
The table join program according to claim 1.
Further causing the one or more processors to perform a function of executing a calculation based on the first numerical data included in the first table portion and the second numerical data included in the second table portion.
The table join program according to claim 1 or 2.
Further causing the one or more processors to perform a function of adding annotation information indicating at least one of the calculation result based on the first numerical data and the second numerical data and the content of the calculation to the digital document.
The table join program according to claim 3.
When the first header information of the first table portion and the second header information of the second table portion are the same, it is determined that the first table portion and the second table portion can be combined.
The table joining program according to any one of claims 1 to 4.
Only when there is no text between the first table part and the second table part in the digital document, or only a predetermined predefined text between the first table part and the second table part. Is present, it is determined that the first table portion and the second table portion can be combined.
The table joining program according to any one of claims 1 to 5.
The digital document is an unstructured document,
The table join program according to any one of claims 1 to 6.
For one or more processors
The ability to extract numbers from unstructured documents and
The function to execute the calculation based on the extracted numerical value and
A function to add annotation information indicating the result of the calculation and at least one of the contents of the calculation to the unstructured document, and
Annotation addition program to be executed.
For one or more processors
A function to analyze the patterns of cells that make up a table contained in a digital document, and
A function to divide the table into sub-tables for each iteration when a repeating pattern appears in any column or row of the table.
A table split program to be executed.
For one or more processors
A function to estimate whether the cells constituting the table included in the digital document correspond to the header element or the table data element, and
A first header column in which the table contains only cells corresponding to the header element, a first data column containing only cells corresponding to the table data element, and a second header column containing only cells corresponding to the header element. And a second data column containing only cells corresponding to the table data element, and when the first data column and the second header column are included as boundaries in this order, the table is used as two subtables. With the function to divide into
A table partition program that executes.
Equipped with one or more processors
The one or more processors may execute computer-readable instructions.
The first table part and the second table part different from the first table part are detected from the digital document.
It is determined whether the first table portion and the second table portion can be combined, and the result is determined.
When it is determined that the first table portion and the second table portion can be combined, the first table portion and the second table portion are combined.
Table join system.
A table join method performed by one or more computer processors executing computer-readable instructions.
The process of detecting the first table portion and the second table portion different from the first table portion from the digital document, and
A step of determining whether the first table portion and the second table portion can be combined,
A step of joining the first table portion and the second table portion when it is determined that the first table portion and the second table portion can be combined.
Table join method comprising.
Equipped with one or more processors
The one or more processors may execute computer-readable instructions.
Upload the digital document to the server,
Annotated document to which the execution result of the calculation based on the numerical value extracted from the digital document and the annotation information indicating at least one of the contents of the calculation is added is acquired from the server, and the annotated document is displayed.
User device.
Annotation method performed by one or more computer processors executing computer-readable instructions.
The process of extracting numbers from unstructured documents and
The process of executing the calculation based on the extracted numerical values, and
A step of adding annotation information indicating the result of the calculation and at least one of the contents of the calculation to the unstructured document, and
Annotation method with.
A method of table partitioning performed by one or more computer processors executing computer-readable instructions.
The process of analyzing the patterns of cells that make up a table contained in a digital document, and
A step of dividing the table into subtables for each iteration when a repeating pattern appears in any column or row of the table.
A table partitioning method comprising.
A method of table partitioning performed by one or more computer processors executing computer-readable instructions.
The process of estimating whether the cells constituting the table included in the digital document correspond to the header element or the table data element, and
A first header column in which the table contains only cells corresponding to the header element, a first data column containing only cells corresponding to the table data element, and a second header column containing only cells corresponding to the header element. And a second data column containing only cells corresponding to the table data element, and when the first data column and the second header column are included as boundaries in this order, the table is used as two subtables. And the process of dividing into
How to split a table to execute.
How to create an electronically annotated unstructured document
The process of acquiring unstructured documents and
A step of detecting a first table portion and a second table portion different from the first table portion from the unstructured document, and
A step of determining whether the first table portion and the second table portion can be combined,
A step of joining the first table portion and the second table portion when it is determined that the first table portion and the second table portion can be combined.
A step of executing a calculation based on the first numerical data included in the first table portion and the second numerical data included in the second table portion.
A step of generating the annotated unstructured document by adding annotation information indicating at least one of the first numerical data, the calculation result based on the second numerical data, and the content of the calculation to the unstructured document. And how to include.