WO2022113378A1 - Table combining program, table combining system, and table combining method - Google Patents
Table combining program, table combining system, and table combining method Download PDFInfo
- Publication number
- WO2022113378A1 WO2022113378A1 PCT/JP2020/048664 JP2020048664W WO2022113378A1 WO 2022113378 A1 WO2022113378 A1 WO 2022113378A1 JP 2020048664 W JP2020048664 W JP 2020048664W WO 2022113378 A1 WO2022113378 A1 WO 2022113378A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- table portion
- data
- calculation
- digital document
- document
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 66
- 238000001514 detection method Methods 0.000 claims abstract description 22
- 238000004364 calculation method Methods 0.000 claims description 125
- 238000000638 solvent extraction Methods 0.000 claims description 6
- 238000003860 storage Methods 0.000 description 56
- 230000006870 function Effects 0.000 description 30
- 238000013075 data extraction Methods 0.000 description 24
- 238000003384 imaging method Methods 0.000 description 15
- 238000012545 processing Methods 0.000 description 11
- 230000008878 coupling Effects 0.000 description 10
- 238000010168 coupling process Methods 0.000 description 10
- 238000005859 coupling reaction Methods 0.000 description 10
- 238000004891 communication Methods 0.000 description 7
- 238000012015 optical character recognition Methods 0.000 description 7
- 239000000470 constituent Substances 0.000 description 6
- 239000000284 extract Substances 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 230000014759 maintenance of location Effects 0.000 description 4
- 239000000969 carrier Substances 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 239000007788 liquid Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000010276 construction Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/177—Editing, e.g. inserting or deleting of tables; using ruled lines
Definitions
- the disclosure herein relates to a table join program, a table join system, and a table join method for processing table data extracted from a table contained in a digital document.
- Digital documents often include tables.
- IR materials such as financial statements include many tables.
- table data includes, for example, text data expressed in a table format.
- Conventional techniques for extracting table data from a table included in a digital document are described in, for example, Patent Document 1 and Non-Patent Document 1 below.
- Patent Document 1 when a digital document is created in PDF (Portable Document Form) format, a PDF file is converted into an image file as a method for extracting table data from a table, and the image thereof.
- image-based method A method of extracting table data based on a file
- direct method a method of reading text data and formatting data such as vertical ruled lines and horizontal ruled lines directly from a PDF file. It is known.
- the table layout is recognized and the table data is extracted using the image file converted from the PDF file.
- the table layout is recognized by analyzing image information such as the ruled lines of the table, and the table data (for example, the text contained in the table) is extracted by, for example, OCR (Optical Character Recognition / Reader). ..
- the text is read directly from the PDF file, and the layout representing the positional relationship between the texts is recognized.
- Digital documents include various types of tables.
- some tables have at least one of the rows or columns separated by whitespace.
- whitespace In a table in which rows or columns are separated by such blanks, it is difficult to recognize the table layout by an image-based method.
- the one table is divided into multiple table parts. It may be displayed.
- such a table divided into two table portions and displayed is recognized as two separate tables.
- One of the objects of the invention disclosed herein is to solve or alleviate problems in conventional table data extraction techniques.
- One of the more specific objects of the invention disclosed herein is to provide a table join program capable of joining a plurality of table parts when one table is divided into a plurality of table parts in a digital document. That is.
- the table join program has a table detection function for detecting a first table portion and a second table portion different from the first table portion from a digital document, and the first table portion and the first table portion.
- the determination function for determining whether the two table portions can be combined, and when it is determined that the first table portion and the second table portion can be combined, the first table portion and the second table portion are combined. To execute the function to be performed.
- the first table portion is extracted from the first page
- the second table portion is extracted from the second page different from the first page
- the table join program is based on the first numerical data included in the first table portion and the second numerical data contained in the second table portion in one or more processors. Perform more functions to perform calculations.
- the table join program provides the one or more processors with notes indicating to the one or more processors the result of a calculation based on the first numerical data and the second numerical data and at least one of the contents of the calculation. Further execute the function of adding information to the digital document.
- the first table portion and the second table portion when the first header information of the first table portion and the second header information of the second table portion are the same, the first table portion and the second table portion Is determined to be connectable.
- first table portion and the second table portion there is no text between the first table portion and the second table portion in the digital document, or the first table portion and the second table portion.
- first table portion and the second table portion can be combined.
- the digital document is an unstructured document.
- the annotation addition program has a function of causing one or more processors to extract a numerical value from an unstructured document, a function of causing a calculation based on the extracted numerical value, and the calculation.
- the function of adding annotation information indicating at least one of the result of the calculation and the content of the calculation to the digital document is executed.
- the annotation method according to one or more embodiments of the present invention is performed by one or more computer processors executing computer-readable instructions.
- the annotation addition method includes a step of extracting a numerical value from an unstructured document, a step of executing a calculation based on the extracted numerical value, and an annotation information indicating at least one of the result of the calculation and the content of the calculation. It includes a process to add to a digital document.
- the table partitioning program allows one or more processors to analyze patterns of cells constituting a table contained in a digital document, and in any column or row of the table. When a repetition pattern appears, the function of dividing the table into sub-tables for each repetition is executed.
- the table partitioning method according to one or more embodiments of the present invention is performed by one or more computer processors executing computer-readable instructions.
- the table division method includes a step of analyzing a pattern of cells constituting a table included in a digital document, and a subtable for each iteration of the table when a repeating pattern appears in any column or row of the table. It is provided with a process of dividing into.
- the table join system comprises one or more processors.
- the one or more processors detect the first table portion and the second table portion different from the first table portion from the digital document by executing a computer-readable instruction, and the first table portion and the first table portion. It is determined whether or not the two table portions can be combined, and when it is determined that the first table portion and the second table portion can be combined, the first table portion and the second table portion are combined.
- the table division method includes a step of estimating whether a cell constituting a table included in a digital document corresponds to a header element or a table data element, and the table is the header element.
- a step of dividing the table into two sub-tables with a boundary between the first data column and the second header column when the second data column containing only the cells corresponding to the above is included in this order. To prepare for.
- the table join method is executed by one or more computer processors executing computer-readable instructions.
- the table joining method includes a step of detecting a first table portion and a second table portion different from the first table portion from a digital document, and a step of determining whether the first table portion and the second table portion can be joined.
- a step of joining the first table portion and the second table portion when it is determined that the first table portion and the second table portion can be combined is provided.
- the user apparatus comprises one or more processors, wherein the one or more processors upload a digital document to a server by executing a computer-readable instruction.
- An annotated document to which an execution result of a calculation based on a numerical value extracted from a digital document and an annotation information indicating at least one of the contents of the calculation is added is acquired from the server, and the annotated document is displayed.
- the table splitting program has a function of estimating whether a cell constituting a table included in a digital document corresponds to a header element or a table data element in one or more processors.
- a first header column in which the table contains only cells corresponding to the header element a first data column containing only cells corresponding to the table data element, and a second header containing only cells corresponding to the header element.
- the table is divided into two subs with the boundary between the first data column and the second header column as a boundary.
- the function to divide into tables and to execute.
- the method for creating an electronically annotated unstructured document includes a step of acquiring the unstructured document and a table 1 portion and a table 1 portion from the unstructured document.
- Annotated by adding to the unstructured document the step of executing the calculation, the result of the calculation based on the first numerical data and the second numerical data, and the annotation information indicating at least one of the contents of the calculation. It comprises a step of generating an unstructured document.
- a table joining program capable of joining a plurality of table parts when one table is divided into a plurality of table parts in a digital document.
- FIG. 3 is an explanatory diagram showing two tables included in a digital document processed by the table-joining system of FIG.
- FIG. 3 is an explanatory diagram showing two tables included in a digital document processed by the table-joining system of FIG.
- the table coupling system 1 includes a user device 10 and a server 20.
- the table coupling system 1 may include storage 30.
- the user device 10, the server 20, and the storage 30 are communicably connected to each other via the network 40.
- the network 40 may be a single network, or may be configured by connecting a plurality of networks.
- the network 40 is, for example, the Internet, a mobile communication network, and a combination thereof. As the network 40, any network that enables communication between electronic devices can be applied.
- the table coupling system 1 shown in FIG. 1 is an example of a system to which the present invention can be applied, and the system to which the present invention can be applied is not limited to the system shown in FIG.
- the table coupling system 1 to which the present invention can be applied does not have to include some of the components shown.
- the table join system 1 does not have to include the storage 30.
- the table coupling system 1 may include components (not shown).
- the table coupling system 1 can include any number of two or more user devices 10.
- the illustrated table-joining system 1 includes a user device 10 and a server 20 connected to a network 40, but one of the user device 10 and the server 20, which is a subcombination of the table-joining system 1, is claimed.
- the table-joining system described in the claims may include the configuration and functions of both the user device 10 and the server 20 as constituent requirements, or one of the user device 10 and the server 20 which is a subcombination of the table-joining system 1. It is not necessary to make the configuration or function of one a constituent requirement and the other configuration or function a constituent requirement.
- the table-joining system described in claim includes the configurations and functions of both the user apparatus 10 and the server 20 as constituent requirements
- the table-joining system 1 including the user apparatus 10 and the server 20 is claimed.
- the table-joining system described in the claim includes only the configuration and the function of the user device 10 as a constituent requirement
- the user device 10 shown in the figure relates to the table-joining system described in the claim.
- the illustrated server 20 is the invention relating to the table-joining system described in the claims.
- the table join system 1 identifies a table included in a digital document or a table portion that is a part thereof. In a digital document, if one table is too long to fit on one page, or if the table is too long to fit on one page, then the table is multiple tables. It may be divided into parts and displayed.
- the table join system 1 can join a plurality of table parts in which one table is divided into a plurality of table parts.
- a digital document suitable for handling by the table-joining system 1 is, for example, an unstructured document having no structure definition.
- the unstructured document is, for example, a document in PDF format.
- the user device 10 is a personal computer (PC), a tablet terminal, a smartphone, or various information processing devices other than these.
- the user apparatus 10 includes a processor 11, a memory 12, a user interface 13, a communication interface 14, and a storage 15.
- the processor 11 is an arithmetic unit that loads an operating system and various other programs from the storage 15 or other storage into the memory 12 and executes instructions included in the loaded program.
- the processor 11 is, for example, a CPU, an MPU, a DSP, a GPU, various arithmetic units other than these, or a combination thereof.
- the processor 11 may be realized by an integrated circuit such as an ASIC, PLD, FPGA, MCU or the like.
- the memory 12 is used to store instructions executed by the processor 11 and various other data.
- the memory 12 is a main storage device (main memory) that the processor 11 can access at high speed.
- the memory 12 is composed of, for example, a RAM such as a DRAM or an SRAM.
- the user interface 13 includes an input interface that accepts user input and an output interface that outputs various information under the control of the processor 11.
- the input interface is, for example, a pointing device such as a keyboard or a mouse, a touch panel, or any information input device capable of inputting input from a user other than the above.
- the output interface is, for example, a liquid crystal display, a display panel, or any information output device capable of outputting the calculation result of the processor 11 other than the above.
- the communication interface 14 is implemented as hardware, firmware, communication software such as a TCP / IP driver or PPP driver, or a combination thereof.
- the user device 10 can send and receive data to and from other devices such as the server 20 via the communication interface 14.
- the storage 15 is an external storage device accessed by the processor 11.
- the storage 15 is, for example, a magnetic disk, an optical disk, a semiconductor memory, or various storage devices other than those capable of storing data.
- the storage 15 may store the digital document 15a.
- the digital document 15a may be, for example, an unstructured document such as a PDF file.
- the processor 11 of the user device 10 functions as an upload unit 11a and a display unit 11b.
- the upload unit 11a uploads the digital document to the server 20.
- the upload unit 11a can read the digital document 15a from the storage 15 and upload the read digital document 15a to the server 20.
- the display unit 11b displays a digital document on the display.
- the display unit 11b may display, for example, the digital document 15a read from the storage 15a on the display.
- the display unit 11b may receive a digital document from the server 20 and display the digital document received from the server 20 on the display.
- the display unit 11b may receive the annotated document 25c described later from the server 20 and display it on the display.
- the server 20 includes a processor 21, a memory 22, a user interface 23, a communication interface 24, and a storage 25.
- the processor 21 is an arithmetic unit that loads various programs for providing an operating system and an application into the memory 22 and executes instructions included in the loaded programs.
- the description of the processor 11 also applies to the processor 21, and the description of the memory 12 also applies to the memory 22.
- the user interface 23 includes an input interface that accepts the input of the operator of the server 20 and an output interface that outputs various information under the control of the processor 21.
- the communication interface 24 is implemented as hardware, firmware, communication software such as a TCP / IP driver or PPP driver, or a combination thereof.
- the server 20 can send and receive data to and from other devices via the communication interface 24.
- the storage 25 is an external storage device accessed by the processor 21.
- the storage 25 is, for example, a magnetic disk, an optical disk, a semiconductor memory, or various storage devices other than those capable of storing data.
- the table join program 25a and the table join program for extracting the table data from the table or the table part included in the digital document and joining the table parts based on the extracted table data in the storage 25.
- the original document 25b which is a digital document to be analyzed
- the annotated document 25c which includes annotations added by executing the table join program
- the instructions included in the table join program 25a may be executed by the processor 21. Details of the functions realized by executing the table join program 25a will be described later.
- the original document 25b is, for example, a digital document uploaded from the user device 10.
- the original document 25b can include one or more tables.
- the original document 25b may be an unstructured document such as a PDF file that does not have a structure definition.
- An unstructured document includes objects such as text and images contained in each page constituting the document, and coordinate information indicating the arrangement of the objects in the page, but does not include information indicating the structure of the document. ..
- the annotated document 25c is a document in which annotation information indicating the content and result of calculation of numerical data included in the original document 25b is added to the original document 25b.
- annotation information may be stored in the storage 25 as an object of the original document 25b.
- various data that can be stored in the storage 15 may be stored in a storage (for example, storage 25 or storage 30) or a database server that is physically separate from the user device 10.
- various data that can be stored in the storage 25 may be stored in a storage (for example, storage 15 or storage 30) or a database server that is physically separate from the server 20.
- the storage 15 and the storage 25 are each shown as a single unit, but at least one of the storages 15 and 25 may be a collection of a plurality of physically separate storages. good.
- the data stored in the storage 15 and the data stored in the storage 25 may be stored in a single storage or may be distributed and stored in a plurality of storages.
- the term “storage” may refer to either a single storage or a collection of multiple storages, as is permitted in the context.
- the processor 21 of the server 20 executes the instruction included in the table join program 25a or other instructions to execute the table detection unit 21a, the table data extraction unit 21b, the determination unit 21c, the connection unit 21d, the calculation unit 21e, and the calculation unit 21e. It functions as an annotation addition unit 21f.
- the table detection unit 21a detects the table included in the digital document to be analyzed.
- the table detection unit 21a can perform image processing such as rectangle detection processing on the digital document to be analyzed, and detect the rectangular elements included in the digital document as a table.
- the table detection unit 21a can detect a table from a digital document by any known method other than the rectangle detection process.
- the digital document to be analyzed is, for example, the original document 25b stored in the storage 25.
- the original document 25b may include a table divided into a plurality of table portions. For example, in the original document 25b, a table that is longer in the horizontal direction than the width of the page may be divided into a plurality of table portions.
- the one table when a table is too long in the vertical direction to be displayed on one page, the one table may be divided into a plurality of table portions and arranged across a plurality of pages. ..
- the table detection unit 21a detects each of the divided plurality of table portions as one table. In other words, the table detection unit 21a detects the table contained in the digital document without distinguishing whether the table contained in the digital document is the whole table or a part of the table. For example, when one table included in a digital document is divided into two parts, a first table part and a second table part, each of the first table part and the second table part is detected as a table.
- the table detection unit 21a When the digital document contains a plurality of pages, the table detection unit 21a performs a process of detecting a table in each of the plurality of pages.
- table as used herein is used in a general sense. That is, the "table” in the present specification means that data such as characters and numbers are described and represented in cells separated by ruled lines. In the table, a part of the ruled line separating the cells may be omitted, and the cells may be separated by a blank white space. A cell in which one of the top, bottom, left, and right is separated by a space in this way may be called a space delimited cell (whitespace delimited cell).
- the table data extraction unit 21b recognizes the layout of each table detected by the table detection unit 21a, and also extracts table data arranged in a plurality of cells constituting the table.
- the table data is characters, numbers, or other data arranged in each cell.
- the table data extraction unit 21b may store the extracted table data in the storage 25 for each table.
- the table data of each table may be stored in the storage 25 in association with the table identification information that identifies each table.
- the table data extraction unit 21b can extract each table data of the table detected by the table detection unit 21a by an image-based method, a text-based method, or a known method other than these.
- the text-based method is a method of converting a PDF file into a text file and extracting table data based on the text file.
- the table data extraction unit 21b converts the original document 25b into an image file, and recognizes the table layout and extracts the table data using this image file.
- the table data extraction unit 21b performs rectangular detection processing on an image file converted from a digital document, detects rectangular graphic elements included in the table, and recognizes the detected rectangular graphic elements as cells. ..
- the table data extraction unit 21b detects the coordinate information, width, and height of each detected cell, and recognizes cells belonging to the same column and the same row based on the coordinate information. The cells recognized in this way are surrounded by ruled lines. Further, the table data extraction unit 21b detects the table data arranged in each cell by, for example, OCR (Optical Character Recognition / Reader).
- OCR Optical Character Recognition / Reader
- the table data extraction unit 21b can determine whether or not the cells separated by the ruled lines include the blank-separated cells. For example, if a cell separated by a ruled line contains text divided into multiple rows or columns, there is a blank separator line between the multiple rows and / or between the columns. Can be determined. In this case, the table data extraction unit 21b can recognize a plurality of blank-separated cells in the cells separated by the ruled line, and can detect the table data contained in the blank-separated cells.
- the table data extraction unit 21b converts the document data into a text file and generates a text file in which the text contained in the text file is given the coordinates in the page. Then, the layout of the table is recognized by analyzing the positional relationship between the texts based on this coordinate information.
- the table data extraction unit 21b can perform both an image-based method for extracting table data and a text-based method for extracting table data.
- the image-based method of extracting table data and the text-based method of extracting table data may be executed in parallel, or one of them may be executed sequentially before the other.
- the determination unit 21c determines whether or not the plurality of tables detected by the table detection unit 21a can be combined with other tables. When a table that is originally supposed to be one table is divided into a plurality of table parts due to page layout or the like, the plurality of table parts can be combined.
- the determination unit 21c can determine whether or not one table detected by the table detection unit 21a can be combined with another table based on the table data detected by the table data extraction unit 21b. Specifically, the determination unit 21c determines that, for example, when the header information included in the table data of the two tables detected by the table detection unit 21a is the same as each other, the two tables can be combined. Can be done.
- the determination unit 21c determines whether or not text data other than the table data is included between the two tables for which the header information is determined to be the same, and the text is included between the two tables. If not, it may be determined that the two tables can be combined. If there is no text between the two tables, each of the two tables is presumed to be a divided table portion of one table. Therefore, the determination unit 21c can determine that the two tables can be combined when there is no text between the two tables and the header information is the same between the two tables. Two tables may be spread across two different pages.
- the page number the text outside the page description area (eg, header, footnote, footnote, document reference symbol number), the table legend (eg, "(eg," in the accounting document. Information that can be ignored as the content of a document such as "1 million yen)", a table legend), a table caption, or other documents may be arranged. Even if information that is not related to the contents of the document is placed between the two tables, if the two tables are displayed across consecutive pages (for example, one of the two tables is the first). If it is arranged on page 10 and the other of the two tables is arranged on page 11), it is likely that the two tables were created with the intention of being one table.
- the determination unit 21c may determine that the two tables can be combined when the two tables are arranged on consecutive pages and the header information of the two tables is the same. Also, if the text that can be ignored is predetermined as the predefined text and only the predefined text is placed between the two tables and the header information of the two tables is the same, the two tables are combined. It may be determined that it is possible.
- FIG. 2 shows Table T11 and Table T12.
- Table T11 and Table T12 are arranged, for example, on the same page of a digital document stored as the original document 25b.
- both Table T11 and Table T12 are table parts that form part of one statement of changes in shareholders' equity.
- the row header information T11a in the table T11 has the texts "balance at the beginning of the current period", "variable amount for the current period” ... "Balance", "Floating amount for the current period” ... "Balance at the end of the current period”. Therefore, the row header information T11a in the table T11 and the row header information T12a in the table T12 match.
- the determination unit 21c can determine that the table T11 and the table T12 can be combined. For example, when one table cannot be displayed within the width of one page due to the limitation of the page width, one table may be divided into two tables and included in the digital document as in table T11 and table T12. be. In this way, the determination unit 21c can join the two tables divided in this digital document based on the row header information of each of the two tables in which one table is horizontally divided in the horizontal direction (horizontal). Can be combined).
- Table T21 and Table T22 are shown in FIG. Both Table T21 and Table T22 show a portion of the consolidated cash flow statement.
- Table T21 column header information T21a is, in order from the left, "previous consolidated fiscal year (April 1, 2017 to March 31, 2018)” and “current consolidated fiscal year (April 1, 2018 to 2019)”.
- the text is "March 31)”
- the column header information T22a in Table T22 is also "previous consolidated fiscal year (from April 1, 2017 to March 31, 2018)” and "current consolidated fiscal year” in order from the left. (Own April 1, 2018 to March 31, 2019) ".
- the column header information T21a in the table T21 and the column header information T22a in the table T22 match. Further, between the table T21 and the table T22, a text representing "-48-" indicating the page number and "unit: 1,000 yen” which is a legend of the consolidated cash flow statement is arranged. By registering the note of the table "unit: 1,000 yen" that frequently appears as the page number and the legend of the table as the predefined text, the determination unit 21c can determine that the table T21 and the table T22 can be combined. can. That is, the determination unit 21c has a list of text patterns used to classify the text as predefined text.
- the text "-48-” has a text pattern of numbers sandwiched between "-” and can be classified into the category "page number”.
- Each category of predefined text may be weighted. For example, the importance of the categories "page number”, “document reference symbol”, and “table legend” may be "non-important".
- the determination unit 21c determines whether all the text between the two tables belongs to the "non-important" category, and even so. If it can be determined that it can be combined. In another embodiment, even if the predefined text is not defined, the determination unit 21c is arranged on a continuous page (pages 48 and 49) of the table T21 and the table T22 having common header information.
- the table T21 and the table T22 can be combined.
- one table may be divided into two tables and included in a digital document as in table T21 and table T22.
- the determination unit 21c can join (vertically) the two tables divided in this digital document based on the column header information of each of the two tables in which one table is vertically divided vertically. Can be combined).
- the joining unit 21d joins two tables determined to be connectable by the determination unit 21c. For example, when the table T11 and the table T12 shown in FIG. 2 are determined by the determination unit 21c to be connectable, the connection unit 21d joins the table T11 and the table T12. By combining the table T11 and the table T12, a table different from the table detected by the table detection unit 21a is generated. In the present specification, a table obtained by combining two tables included in a digital document stored as an original document 25b may be referred to as a "combined table".
- the join unit 21d can store the data contained in the two tables before the join in the storage 25 in association with the table identification information for identifying the join table obtained by joining the two tables.
- the joining portion 21d uses the two tables divided in the horizontal direction.
- a horizontally joined join table the row data in the table before join is integrated.
- the data and the table before the join are included in the row whose row header is "balance at the beginning of the current period" in the table T11 before the join.
- T12 both of the data contained in the row whose row header is "balance at the beginning of the current period" are the data of the row whose row header is the "balance at the beginning of the current period" in the horizontally joined join table.
- the joining portion 21d uses the two vertically divided tables.
- a vertically joined join table the column data in the table before join is integrated.
- the column header in the table T21 before the joining is "previous consolidated fiscal year (from April 1, 2017 to March 2018).
- the data included in the column "Month 31st)" and the column header in the table T22 before joining are included in the column whose column header is "Previous consolidated fiscal year (April 1, 2017 to March 31, 2018)". Both of the data will be the data of the row whose row header is "previous consolidated fiscal year (from April 1, 2017 to March 31, 2018)" in the vertically joined join table.
- the joining portion 21d may join three or more tables.
- the criterion for joining three or more tables may be the same as the criterion for joining two tables. For example, if the row header information of the three tables matches each other and there is no text between the three tables, or only the predefined text, then the three tables are combined into one. It can be a join table. Three or more tables may be joined in stages. For example, when joining the first table, the second table, and the third table, first, the first table and the second table are joined to create an intermediate join table, and this intermediate join table is created. And the third table may be combined to generate the final combined table.
- the calculation unit 21e stores the numerical values included in the table data extracted by the table data extraction unit 21b and / or the table identification information of the join table obtained by joining the two tables by the join unit 21d. Calculations can be performed based on the numerical values contained in the table data of the join table stored in 25. For example, if the digital document contains a balance sheet, the amount of assets contained in the assets section of the balance sheet is included in the table data of the balance sheet. The calculation unit 21e can calculate the total amount of assets by adding all the amounts of assets included in the table data of the balance sheet, for example. The calculation unit 21e can execute the calculation for each various subset of the data stored as the table data. For example, the total current assets can be calculated by summing the numerical values included in the items of current assets included in the table data of the balance sheet.
- the calculation unit 21e can also perform the calculation based on the table data of the join table. For example, if the balance sheet is divided into two pages in a digital document, and the table parts that make up the balance sheet that are divided into two pages are not joined, it is based on the table data. The figures contained in the balance sheet cannot be calculated correctly. For example, the balance sheet is divided into two pages, and the table part placed on the preceding page contains all of the current assets and part of the fixed assets and is placed on the following pages. If the balance of fixed assets is placed in the table part, and if this two-divided balance sheet is not combined, calculate the total assets, which is the total of current assets and fixed assets. I can't.
- the joining portion 21d joins two tables (two table portions in which the balance sheet is divided), which was originally one table like a two-divided balance sheet. Therefore, even if the digital document contains a divided table, the calculation can be performed correctly based on the table data of the combined table. Even if the digital document contains a balance sheet that is divided into two pages, according to the embodiment of the present invention, it is calculated by the amount of assets, the amount of liabilities and other calculations based on the numerical values contained in the balance sheet. The number to be done can be calculated correctly.
- the annotation addition unit 21f can add annotation information indicating the result and / or content of the calculation performed by the calculation unit 21e based on the table data of the table to the digital document.
- the object corresponding to the annotation information is in or near the table included in the digital document. Will be added.
- the object corresponding to the annotation information displayed together with the table of the digital document may be referred to simply as the annotation information.
- FIG. 4 shows a portion of the consolidated balance sheet T31 contained in the digital document. Annotation information is added to the consolidated balance sheet T31 of FIG.
- the annotation information includes arrows A1 to A10 indicating the range of calculation, a discrepancy mark S1 indicating that the calculation result by the calculation unit 21e does not match the data corresponding to the calculation result in the table data, and the calculation unit 21e. Includes a match mark S2 indicating that the calculation result according to is matched with the data corresponding to the calculation result among the table data.
- Arrows A1 to A10 indicating the calculation range extend from the start point to the end point, and the numerical value of the cell between the start point and the end point of each arrow is the calculation target, and the calculation result is displayed in the cell where the end point has reached. It shows that it has been done.
- arrow A1 is the start cell of the item of current assets among the cells contained in the same column as the column header cell containing the text "previous consolidated fiscal year (March 31, 2018)". It extends from the "Cash and Deposits" cell) to the end cell of the Current Assets item ("Total Current Assets" cell).
- This arrow A1 indicates “cash and deposits”, “notes and accounts receivable”, and “inventories” between the start and end points of arrow A1 in the column of "previous consolidated fiscal year (March 31, 2018)". , "Accounts receivable”, “Other”, and “Allowance for doubtful accounts” are added, and the result of this addition is "Total current assets” corresponding to the end point of arrow A1. It is shown to be displayed in the cell of the row. However, since the numbers in the "allowance for doubtful accounts” row are marked with “ ⁇ ”, the numbers in this "allowance for doubtful accounts” row are converted to minus and then added (that is, subtracted). Ru).
- the result calculated by the calculation unit 21e in the calculation range indicated by the arrow A1 is "38,545,156", and this calculation result is displayed in the cell in the row of "Total liquid assets” in the consolidated balance sheet T31.
- the calculation result of the calculation unit 21e and the original document 25b to be analyzed are displayed.
- a matching mark S2 is attached to indicate that the numerical values in the table included in the above match.
- Arrows A2 to A10 also indicate the calculation range in the same way as arrow A1.
- arrow A2 indicates the start cell of the item of current assets (“Cash and deposits”” among the cells included in the same column as the column header cell containing the text “Current consolidated fiscal year (March 31, 2019)”. Since it extends from the cell) to the end cell of the item of current assets (cell of "total current assets”), “cash and deposits”, “notes receivable and” between the start point and the end point of the arrow are similar to arrow A1.
- the calculation result of the calculation unit 21e and the table included in the original document 25b to be analyzed are included.
- a mismatch mark S1 is attached to indicate that the numerical values of are inconsistent with each other.
- "-1" is displayed small as the difference information indicating the difference value together with the discrepancy mark S1.
- the calculation result by the calculation unit 21e (“46,398,833” in the above example) may be displayed in the case of a mismatch.
- the column of fixed assets includes an item showing a subtotal that is the sum of some of the items included in fixed assets.
- the arrow of the annotation information can also indicate the calculation range and the calculation result of such a subtotal.
- “buildings and structures (net amount)” is a subtotal of the values of "buildings and structures” and the values of "accumulated depreciation" among the items included in fixed assets. be.
- Arrow A3 indicates the calculation range when calculating this "building and structure (net amount)".
- the result calculated by the calculation unit 21e in the calculation range indicated by the arrow A3 is "2,002,570", and this calculation result is in the row of "Buildings and structures (net amount)" in the consolidated balance sheet T31. Compared to “2,002,569" displayed in the cell, there is only “1” more. Therefore, in the cell of the row of "Buildings and structures (net amount)" in the column of "Previous consolidated fiscal year (March 31, 2018)", the calculation result of the calculation unit 21e and the original document 25b to be analyzed are displayed. A mismatch mark S1 is attached to indicate that the numerical values in the table included in the above are inconsistent. Further, in order to show that the correct calculation result is larger by "1” than the indicated numerical value, the difference information "1" is displayed small together with the mismatch mark S1.
- Arrow A10 indicates the calculation range for calculating "total tangible fixed assets” among fixed assets.
- total tangible fixed assets includes “buildings and structures (net amount)", “mechanical equipment and carriers (net amount)", and “others (net amount)”. ) ”,“ Land ”, and“ Construction in progress ”. Since “buildings and structures (net amount)", “mechanical equipment and carriers (net amount)", and “others (net amount)” are subtotals of some of the fixed asset items, the starting point of arrow A10. Between and the end point, both the item to be calculated and the item indicating the subtotal which is the calculation result of the calculation item are included.
- each item of the fixed asset will be calculated twice.
- “buildings and structures (net amount)” is a subtotal obtained by adding the values of "buildings and structures” and the values of "accumulated depreciation” among the items included in fixed assets, so “tangible fixed assets”. If all the "buildings and structures”, “accumulated depreciation”, and “buildings and structures (net amount)” included in the range of arrow A10 are included in the calculation for "total”, then “buildings” And the structure "and” accumulated depreciation "will be calculated twice.
- the arrow A10 indicates the section corresponding to the item to be calculated among the items between the start point and the end point by the solid line A10a, and the item to be excluded from the calculation target is indicated by the broken line A10b.
- the calculation unit 21e can perform the calculation based on the table data of the join table in which the two tables are joined.
- the annotation addition unit 21f can add annotation information indicating the result and / or content of the calculation calculated based on the table data of the join table to the join table.
- 5a and 5b show the combined table CT1 (statement of changes in shareholders' equity) obtained by combining the tables T11 and T12 shown in FIG. 2 together with the annotation information.
- arrow A11 is the start cell (the cell of "capital") among the cells contained in the same row as the row header cell containing the text "balance at the beginning of the period". To the end cell (cell of "total shareholders' equity").
- each item between the start point and the end point of the arrow A11 is added in the row of "balance at the beginning of the current period", and the addition result is in the cell of the example of "total shareholders' equity" corresponding to the end point of the arrow A11. It is shown that it is displayed. However, since the "Balance at the beginning of the period" line also includes subtotals, some items are excluded from the calculation to prevent double calculation, and items excluded from this calculation.
- Arrow A11 is indicated by a dotted line in the corresponding cell. In the cells included in the calculation, arrow A11 is shown by a solid line.
- the result calculated by the calculation unit 21e in the calculation range indicated by the arrow A11 is "14,535,608", and this calculation result is the "total shareholders' equity" in the combined table CT1 shown in FIGS. 5a and 5b. Since it matches the value described in the cell, the match mark S2 is attached to the right side of the cell.
- the "total assets" at the end of the statement are the "total shareholders' equity” included in table T11 and the “total shareholders' equity” included in table T12. It is the total with "total of evaluation / conversion difference, etc.” Therefore, the calculation based on the table data of the joined table CT1 is performed across the table data of the table T11 and the table data of the table T12.
- the arrow A12 indicates the calculation range for calculating the “total assets”. Therefore, the calculation range corresponding to the arrow A12 includes the “total shareholders' equity” included in the table T11.
- the display mode at the right end of the table T11 is different between the arrow A11 and the arrow A12, the user can easily identify which is terminated and which is not terminated.
- the arrow A12 is marked with a predetermined continuation symbol (a circle, a square, and a symbol distinguishable from the end of an arrow other than the above), and the corresponding continuation in Table T12 is the same or corresponding.
- a symbol may be added to restart the display of the arrow A12.
- the continuation symbol and the arrow mark indicating the end of the arrow are different so that the user can easily identify whether the arrow continues or not in the table below.
- the annotation information can correctly indicate the calculation range calculated across the plurality of table portions.
- FIG. 3 shows a cash flow statement divided into two pages. If Tables T21 and T22, which are part of this cash flow statement, are combined to create a combined table, the "Cash and Cash Equivalent Increase” at the end of the cash flow statement is the table. It is the total of "cash flows from operating activities” and “cash flows from investing activities” included in T21 and “cash flows from financing activities” included in Table T22.
- the calculation based on the table data of this joined table is performed across the table data arranged in the table T21 and the table data arranged in the table T22, and the annotation information is also the table T21 and the table. It is possible to correctly indicate the calculation range of the calculation performed across T22.
- the digital document to which the annotation information is added may be stored in the storage 25 as the annotated document 25c.
- annotation information expressly described herein is merely an example of the annotation information applicable to the invention disclosed in the present application, and the annotation information applicable to the invention disclosed in the present application. The information is not limited to that specifically described herein.
- the processor 21 of the server 20 can execute a function of receiving a request for sending a digital document from the user device 10.
- the delivery request may include identification information that identifies the digital document.
- the processor 21 can read the digital document from the storage 25 and transmit the read digital document to the user device 10.
- the storage 25 may store the original document 25b, which is an original digital document, and the annotated document 25c, in which annotation information is added to the original digital document.
- the transmission request from the user apparatus 10 may include document type identification information for identifying whether the transmission of the original digital document or the digital document to which the annotation information is added is requested.
- FIG. 6 shows, as an example of a table containing blank delimited cells, a table T41 containing twelve cells delimited by blanks in a region delimited by horizontal ruled lines.
- the table data extraction unit 21b extracts table data from this type of table by an image-based method, for example, the original document 25b including the table T41 (or a page including the table T41 which is a part thereof) is used as an image file. Convert and recognize the table layout using this image file.
- the table data extraction unit 21b detects a ruled line in an image file converted from a digital document, and recognizes a rectangular area between the ruled lines as a temporary cell.
- the ruled lines L1 to L4 are detected, and the area between the ruled lines is recognized as the temporary cells T41a to T41c.
- the table data extraction unit 21b determines whether or not the temporary cells between the ruled lines have rows and columns separated by blanks instead of the ruled lines. For example, when the table data extraction unit 21b has a plurality of texts that are laterally separated from each other in the temporary cell, the temporary cell has columns separated by blanks instead of ruled lines. Can be recognized. Focusing on the arrangement of the text in the horizontal direction of the temporary cell T41b, the text of "buildings and structures", the text of "8,994,000 yen", and "44,049" are inside the temporary cell T41b.
- the leftmost column inside the temporary cell T41b contains a set of texts "buildings and structures" and “mechanical devices and carriers". Since the text of "Other tangible fixed assets” and the text of "Software” are arranged vertically apart from each other, they are inside the temporary cell T41b. It is recognized that there are four lines separated by blanks.
- the table data extraction unit 21b recognizes that a cell separated by a blank of 4 rows ⁇ 3 columns exists inside the temporary cell T41b. Similarly, it is recognized that there are cells separated by blanks in 1 row ⁇ 3 columns inside the temporary cell T41a and inside the temporary cell T41c, respectively.
- the table data extraction unit 21b detects the table data arranged in the cells separated by the blanks recognized as described above by, for example, OCR.
- the calculation unit 21e can execute the calculation based on the table data of the cells separated by the blanks detected as described above.
- the calculation unit 21e can also execute the calculation using both the table data of the cells separated by the blank and the table data of the cells separated by the ruled line.
- the annotation addition unit 21f can add and display the annotation information indicating the result and / or the content of the calculation performed by the calculation unit 21e to the table T41 including the cells separated by blanks. In Table T41 of FIG. 6, an arrow indicating a calculation range and a mismatch mark are added.
- each of the tables shown in FIGS. 7 to 9 similarly to FIG. 6, cells separated by blanks exist in the area divided by the ruled line.
- the table data extraction unit 21b can recognize the cells separated by blanks in each of the tables shown in FIGS. 7 to 9 by the same method as the detection of the cells separated by blanks in FIG. ..
- Annotated information indicating the result and / or content of the calculation of the table data arranged in the cells separated by blanks is added to each of the tables of FIGS. 7 and 8.
- the calculation is performed in both the vertical and horizontal directions, and the table is supplemented with annotation information indicating the result and / or the content of the calculation.
- step S11 the table included in the digital document to be processed is detected.
- the process of this step S11 is performed by, for example, the above-mentioned table detection unit 21a.
- step S12 table data is extracted by a text-based method for each of the tables detected in step S11, and in step S13, table data is extracted by an image-based method for each of the tables detected in step S11.
- Steps S12 and S13 may be performed in parallel, and step S13 may be performed before step S12.
- step S13 cells that are not divided by the ruled line (cells that are separated by a blank) that are arranged in the area divided by the ruled line are detected, and are arranged in the cells that are separated by the blank.
- Table data may be extracted.
- steps S12 and S13 may be omitted.
- step S14 it is determined for each cell whether or not there is duplication between the table data extracted in step S12 and the table data extracted in step S13 (consistency check). If there is duplication, the table data extracted by the image-based method in step S13 is adopted, and the adopted table data is stored in the storage 25 as table data for the table. Therefore, when the calculation process or the annotation addition process using the table data is performed, the table data extracted by the image-based method is used. When extracting table data by the image-based method, the table data extracted by the image-based method takes precedence over the table data extracted by the text-based method. This is because (vertical outer borders and horizontal outer borders) are clearly defined and therefore more accurate as a table extraction.
- step S14 for example, when the position of the table detected in step S12 in the digital document and the position of the table detected in step S13 in the digital document match, the same table is duplicated and extracted. It is determined that it has been done. For example, in both steps S12 and S13, if there are tables at the same or close positions on page 10 of the digital document and the tables are the same or similar in size, the two. The table is determined to be duplicated.
- step S14 even if the table detected in step S12 and the table detected in step S13 are at the same position, if the number of detected cells is different, the number of detected cells is different.
- the table data with the largest number may be adopted.
- Table T41 shown in FIG. 6 contains a large number of cells separated by blanks.
- some cells may not be recognized correctly even if the cells separated by blanks are detected in the image-based method that detects cells based on the ruled line. There is sex. Therefore, if more cells are detected in step S12 in which the table data is detected by the text-based method than in step S13 in which the table data is detected by the image-based method, the cells are detected by this text-based method.
- the created table data may be adopted.
- step S15 the table data of each table extracted in step S12 or S13 is stored in the storage 25 in association with the table identification information of each table. If it is determined in step S14 that the table data detected in step S12 and the table data detected in step S13 overlap, only the table data detected in step S13 is stored in the storage 25, and in step S12. The detected table data may be discarded.
- step S16 it is determined whether or not each of the tables included in the digital document to be processed can be combined with other tables.
- another table having header information that matches the header information of one table included in the digital document is selected as a join candidate.
- another table having row header information or column header information that matches at least one of the row header information and the column header information of one table included in the table detected in step S11 is a table of candidate joins. Is selected as.
- step S16 it is further determined whether or not to perform the coupling with the selected coupling candidate. For example, if there is no text or only predefined text between a table contained in a digital document and a table selected as a candidate to join the table, join the table with the candidate join. It is determined to do.
- the determination process in step S16 may be performed by the determination unit 21c described above.
- step S16 if there is no text or only predefined text between any two of the tables detected in step S11, the two tables are used as a pair of candidate joins. You may choose. In this case, it is determined whether or not at least one of the row header information and the column header information matches between the two tables selected as the join candidate pair, and if they match, the join candidate pair can be joined. It is judged.
- step S16 If the table that can be joined is not specified in step S16, the table join process ends.
- a set of tables that can be joined is specified in step S16, two (or three or more) tables (table portions) determined to be joinable are joined.
- the join table to which the two tables are joined is assigned a table identification number that identifies the join table, and the table data of the two tables before the join is associated with the table identification number. Is remembered.
- the process of joining this table may be performed by the above-mentioned joining portion 21d.
- step S17 when all the tables that can be joined are joined, the table join process ends.
- step S21 numerical data indicating a numerical value is extracted from the table data of the table included in the digital document to be processed.
- step S22 a predetermined calculation is performed based on the numerical value extracted in step S21.
- the numbers extracted in step S21 are calculated according to the rules assumed in the balance sheet. For example, as shown in FIG. 4, the individual item of "current assets" ("cash and deposits") contained in the row between the row header "current assets” and the row header of "total current assets”. The numerical values arranged in the cells corresponding to)) are added, and the value corresponding to the “total current assets” is calculated.
- the calculation rules can be set in advance. For example, if the row header contains the term "total”, the row that is included in the table as a unit of the rows that precede the row header that contains the term "total”. It is possible to set a rule to add the numerical values included in.
- step S23 annotation information indicating the result and / or content of the calculation in step S22 is generated, and the generated annotation information is displayed together with the table.
- the calculation result of “total current assets” calculated in step S22 and the “total current assets” described in the table are displayed. It is determined whether or not the numerical values in the cells included in the corresponding row match, and the mismatch mark S1 or the match mark S2 is displayed according to the result of the determination.
- a numerical value that matches the calculation result in step S22 is displayed in the cell of the row of "Total current assets” in the column of "Previous consolidated fiscal year (March 31, 2018)".
- the match mark S2 is displayed in or near the cell.
- the cell in the row of “Total current assets” in the column of "Current consolidated fiscal year (March 31, 2019)” displays a numerical value that does not match the calculation result in step S22.
- the mismatch mark S1 is displayed in or near the cell. Further, the calculation range is indicated by arrows A1 to A10.
- the annotation addition process may be performed on each of the tables included in the digital document, or may be performed on the joined table generated by joining two or more tables included in the digital document. That is, the annotation processing disclosed herein does not presuppose that the tables contained in the digital document are combined.
- the table join system 1 is described by the name of "table join" system for convenience of explanation, but does not necessarily perform table join.
- the table joining system 1 shown in FIG. 1 when the annotation addition processing is performed without joining the tables included in the digital document, the function of at least one of the determination unit 21c and the joining unit 21d is performed. It is not necessary to have. In this case, the table join system 1 does not have a function of joining the tables included in the digital document, but can perform annotation processing on the table.
- the processor 21 of the table join system 1 divides the table contained in the digital document in addition to the function of joining the tables contained in the digital document or in place of the function of joining the tables contained in the digital document. It may function as a unit (not shown). For example, as shown in FIG. 12, the table partitioning unit determines whether or not a repetition pattern exists in the header information of the columns of the original table included in the digital document as one table, and repeats. If it is determined that a pattern exists, the function of dividing the original table into subtables for each iteration may be executed. For example, in the table T51 shown in FIG.
- C1 to C5 appear repeatedly three times as the header information of the column.
- the processor 21 uses the ruled line L11 between the column header cell in which C1 which is the starting point of the repetition is arranged and the column header cell in which C5 which is the ending point of the repetition is arranged as the cutting position.
- This table may be divided horizontally into three tables (sub-tables) T51-1 to T51-3.
- Each of the subtables T51-1 to T51-3 thus divided contains only one unit repeat pattern consisting of five columns corresponding to the column headers of C1 to C5.
- FIG. 12 an example in which the repetition pattern appears in the column header information has been described, but the repetition pattern may appear in the row header information.
- the original table is vertically divided for each repeating pattern.
- Table identification information that identifies each subtable may be added to each of the divided subtables.
- the table data extraction unit 21b may extract table data for each of the divided sub-tables and store the extracted table data in the storage 25 for each sub-table.
- the table data of each sub-table may be stored in the storage 25 in association with the table identification information that identifies each sub-table.
- the calculation unit 21e can execute the calculation for each divided subtable based on the numerical value included in the table data of each subtable.
- the annotation addition unit 21f can add and display annotation information indicating the result and / or the content of the calculation performed by the calculation unit 21e for each divided subtable.
- Table T61 shown in FIG. 13 C1 to C5 are repeated twice, but C3 appearing the first time counting from the left is the head of a plurality of sub-columns included in the column of GC2 (leftmost). ), While C3 appearing for the second time counting from the left is arranged at the right end of a plurality of sub-columns included in the column in which GC3 is set as header information. Therefore, it is determined that Table T61 does not have a repeatable repeat pattern.
- the repeating pattern that is the unit for dividing the table can be detected by analyzing the text contained in the column header cell or the row header cell.
- the balance sheet shown in FIG. 14 has a column header cell containing the text "Subject” and a column header cell containing the text "Amount” in the "Assets” and “Assets” columns, respectively. Includes. Therefore, the table partitioning unit can determine that the column header cell of "subject” and the column header cell containing the text of "amount” are repeated in the balance sheet. Then, the table division unit can divide the balance sheet into a subtable consisting of columns of "asset part” and a subtable consisting of "liability part” according to this determination.
- the processor 21 may have a function of estimating whether each of the cells constituting the table included in the digital document corresponds to the header cell including the header information or the table data cell including the table data.
- the table includes a first header column containing only a header cell, a first data column containing only a table data cell, a second header column containing only a header cell, and a second data column containing only a table data cell.
- the table may be divided into two sub-tables with the boundary between the first data column and the second header column as a boundary. See FIG. 14 again. In the balance sheet of FIG. 14, the "asset section" is placed on the left side and the "account section" is placed on the right side.
- the "asset section” is divided into a "subject” column containing only header information and a “amount” column containing only the content (data) of this header information.
- the “debt section” is also divided into a column of "subjects” containing only header information and a column of "amount” containing only the content (data) of this header information. Therefore, the balance sheet can be divided into two sub-tables with a boundary between the "Amount" column of the "Assets” section and the "Items" column of the "Assets” section.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Document Processing Apparatus (AREA)
Abstract
The table combining program according to one or a plurality of embodiments of the present invention causes to be executed a table detection function for detecting a first table portion and a second table portion different from the first table portion from a digital document, an assessment function for assessing whether the first table portion and the second table portion can be combined, and a function for combining the first table portion and the second table portion when it is assessed that the first table portion and the second table portion can be combined.
Description
本明細書の開示は、デジタル文書に含まれる表から抽出される表データを加工する表結合プログラム、表結合システム、及び表結合方法に関する。
The disclosure herein relates to a table join program, a table join system, and a table join method for processing table data extracted from a table contained in a digital document.
デジタル文書には、表が含まれることが多い。特に、決算報告書等のIR資料には数多くの表が含まれる。デジタル文書を分析するために、デジタル文書に含まれる表から表データを効率良く抽出することが望まれる。表データには、例えば、表形式で表現されたテキストデータが含まれる。デジタル文書に含まれる表から表データを抽出する従来の技術は、例えば、以下の特許文献1及び非特許文献1に記載されている。
Digital documents often include tables. In particular, IR materials such as financial statements include many tables. In order to analyze a digital document, it is desirable to efficiently extract table data from the tables contained in the digital document. The table data includes, for example, text data expressed in a table format. Conventional techniques for extracting table data from a table included in a digital document are described in, for example, Patent Document 1 and Non-Patent Document 1 below.
特許文献1にも記載されているように、デジタル文書がPDF(Portable Document Format)形式で作成されている場合、表から表データを抽出する方式として、PDFファイルを画像ファイルに変換し、その画像ファイルに基づいて表データを抽出する方式(以下、「画像ベース方式」という。)と、PDFファイルから直接、テキストデータおよび縦罫線や横罫線などのフォーマッティングデータを読み取る方式(以下、「直接方式」という。)とが知られている。
As described in Patent Document 1, when a digital document is created in PDF (Portable Document Form) format, a PDF file is converted into an image file as a method for extracting table data from a table, and the image thereof. A method of extracting table data based on a file (hereinafter referred to as "image-based method") and a method of reading text data and formatting data such as vertical ruled lines and horizontal ruled lines directly from a PDF file (hereinafter referred to as "direct method"). It is known.
画像ベース方式では、PDFファイルから変換された画像ファイルを用いて表レイアウトの認識及び表データの抽出を行う。具体的には、表レイアウトは、表の罫線等の画像情報を解析することで認識され、表データ(例えば、表に含まれるテキスト)は、例えばOCR(Optical Character Recognition/Reader)によって抽出される。
In the image-based method, the table layout is recognized and the table data is extracted using the image file converted from the PDF file. Specifically, the table layout is recognized by analyzing image information such as the ruled lines of the table, and the table data (for example, the text contained in the table) is extracted by, for example, OCR (Optical Character Recognition / Reader). ..
直接方式では、PDFファイルからテキストが直接読み出され、テキスト間の位置関係を表すレイアウトの認識が行われる。
In the direct method, the text is read directly from the PDF file, and the layout representing the positional relationship between the texts is recognized.
デジタル文書には、様々な種類の表が含まれる。例えば、行又は列の少なくとも一方が空白(whitespace)で区切られている表がある。このような空白で行又は列が区切られている表においては、画像ベース方式で表レイアウトの認識を行うことは困難である。
Digital documents include various types of tables. For example, some tables have at least one of the rows or columns separated by whitespace. In a table in which rows or columns are separated by such blanks, it is difficult to recognize the table layout by an image-based method.
1つの表が横方向に長すぎて1ページの幅に収まらない場合や、表が縦方向に長すぎて1ページ内に表示できない場合に、当該1つの表が複数の表部分に分割して表示されていることがある。従来の表データの抽出技術では、このような2つの表部分に分割して表示されている表は、2つの別の表として認識されてしまう。
If one table is too long in the horizontal direction to fit in the width of one page, or if the table is too long in the vertical direction to be displayed in one page, the one table is divided into multiple table parts. It may be displayed. In the conventional table data extraction technique, such a table divided into two table portions and displayed is recognized as two separate tables.
本明細書に開示される発明の目的の一つは、従来の表データ抽出技術における問題を解決又は緩和することである。本明細書に開示される発明のより具体的な目的の一つは、デジタル文書において1つの表が複数の表部分に分割されている場合に複数の表部分を結合できる表結合プログラムを提供することである。
One of the objects of the invention disclosed herein is to solve or alleviate problems in conventional table data extraction techniques. One of the more specific objects of the invention disclosed herein is to provide a table join program capable of joining a plurality of table parts when one table is divided into a plurality of table parts in a digital document. That is.
本明細書に開示される発明の前記以外の目的は、本明細書全体を参照することにより明らかになる。本明細書に開示される発明は、前記の課題に代えて又は前記の課題に加えて、本明細書の記載から把握される課題を解決するものであってもよい。
Other objectives of the invention disclosed herein will become apparent by reference to the entire specification. The invention disclosed herein may solve the problems identified from the description of the present specification in place of or in addition to the above-mentioned problems.
本発明の一又は複数の実施形態による表結合プログラムは、デジタル文書から第1表部分及び前記第1表部分と異なる第2表部分を検出する表検出機能と、前記第1表部分と前記第2表部分とが結合可能か判定する判定機能と、前記第1表部分と前記第2表部分とが結合可能と判定された場合に、前記第1表部分と前記第2表部分とを結合する機能と、を実行させる。
The table join program according to one or more embodiments of the present invention has a table detection function for detecting a first table portion and a second table portion different from the first table portion from a digital document, and the first table portion and the first table portion. The determination function for determining whether the two table portions can be combined, and when it is determined that the first table portion and the second table portion can be combined, the first table portion and the second table portion are combined. To execute the function to be performed.
本発明の一又は複数の実施形態において、前記第1表部分は第1ページから抽出され、前記第2表部分は前記第1ページとは異なる第2ページから抽出される。
In one or more embodiments of the present invention, the first table portion is extracted from the first page, and the second table portion is extracted from the second page different from the first page.
本発明の一又は複数の実施形態による表結合プログラムは、一又は複数のプロセッサに、前記第1表部分に含まれる第1数値データ及び前記第2表部分に含まれる第2数値データに基づいて計算を実行する機能をさらに実行させる。
The table join program according to one or more embodiments of the present invention is based on the first numerical data included in the first table portion and the second numerical data contained in the second table portion in one or more processors. Perform more functions to perform calculations.
本発明の一又は複数の実施形態による表結合プログラムは、前記一又は複数のプロセッサに、前記第1数値データ及び前記第2数値データに基づく計算の結果及び当該計算の内容の少なくとも一方を示す注釈情報を前記デジタル文書に付加する機能をさらに実行させる。
The table join program according to one or more embodiments of the present invention provides the one or more processors with notes indicating to the one or more processors the result of a calculation based on the first numerical data and the second numerical data and at least one of the contents of the calculation. Further execute the function of adding information to the digital document.
本発明の一又は複数の実施形態において、前記第1表部分の第1ヘッダ情報と前記第2表部分の第2ヘッダ情報とが同一の場合に前記第1表部分と前記第2表部分とが結合可能と判定される。
In one or more embodiments of the present invention, when the first header information of the first table portion and the second header information of the second table portion are the same, the first table portion and the second table portion Is determined to be connectable.
本発明の一又は複数の実施形態においては、前記デジタル文書において前記第1表部分と前記第2表部分との間にテキストが存在しない場合、又は、前記第1表部分と前記第2表部分との間に予め定められた事前定義テキストのみが存在する場合に前記第1表部分と前記第2表部分とが結合可能と判定される。
In one or more embodiments of the present invention, there is no text between the first table portion and the second table portion in the digital document, or the first table portion and the second table portion. When only a predetermined predefined text exists between and, it is determined that the first table portion and the second table portion can be combined.
本発明の一又は複数の実施形態において、前記デジタル文書は、非構造化文書である。
In one or more embodiments of the invention, the digital document is an unstructured document.
本発明の一又は複数の実施形態による注釈付加プログラムは、一又は複数のプロセッサに、非構造化文書から数値を抽出する機能と、前記抽出された数値に基づく計算を実行させる機能と、前記計算の結果及び当該計算の内容の少なくとも一方を示す注釈情報を前記デジタル文書に付加する機能と、を実行させる。本発明の一又は複数の実施形態による注釈付加方法は、一又は複数のコンピュータプロセッサがコンピュータ読み取り可能な命令を実行することにより実行される。当該注釈付加方法は、非構造化文書から数値を抽出する工程と、前記抽出された数値に基づく計算を実行させる工程と、前記計算の結果及び前記計算の内容の少なくとも一方を示す注釈情報を前記デジタル文書に付加する工程と、を備える。
The annotation addition program according to one or more embodiments of the present invention has a function of causing one or more processors to extract a numerical value from an unstructured document, a function of causing a calculation based on the extracted numerical value, and the calculation. The function of adding annotation information indicating at least one of the result of the calculation and the content of the calculation to the digital document is executed. The annotation method according to one or more embodiments of the present invention is performed by one or more computer processors executing computer-readable instructions. The annotation addition method includes a step of extracting a numerical value from an unstructured document, a step of executing a calculation based on the extracted numerical value, and an annotation information indicating at least one of the result of the calculation and the content of the calculation. It includes a process to add to a digital document.
本発明の一又は複数の実施形態による表分割プログラムは、一又は複数のプロセッサに、デジタル文書に含まれる表を構成するセルのパターンを解析する機能と、前記表のいずれかの列又は行において繰り返しパターンが出現する場合に、前記表を繰り返しごとにサブテーブルに分割する機能と、実行させる。本発明の一又は複数の実施形態による表分割方法は、一又は複数のコンピュータプロセッサがコンピュータ読み取り可能な命令を実行することにより実行される。当該表分割方法は、デジタル文書に含まれる表を構成するセルのパターンを解析する工程と、前記表のいずれかの列又は行において繰り返しパターンが出現する場合に、前記表を繰り返しごとにサブテーブルに分割する工程と、を備える。
The table partitioning program according to one or more embodiments of the present invention allows one or more processors to analyze patterns of cells constituting a table contained in a digital document, and in any column or row of the table. When a repetition pattern appears, the function of dividing the table into sub-tables for each repetition is executed. The table partitioning method according to one or more embodiments of the present invention is performed by one or more computer processors executing computer-readable instructions. The table division method includes a step of analyzing a pattern of cells constituting a table included in a digital document, and a subtable for each iteration of the table when a repeating pattern appears in any column or row of the table. It is provided with a process of dividing into.
本発明の一又は複数の実施形態による表結合システムは、一又は複数のプロセッサを備える。前記一又は複数のプロセッサは、コンピュータ読み取り可能な命令を実行することにより、デジタル文書から第1表部分及び前記第1表部分と異なる第2表部分を検出し、前記第1表部分と前記第2表部分とが結合可能か判定し、前記第1表部分と前記第2表部分とが結合可能と判定された場合に、前記第1表部分と前記第2表部分とを結合する。本発明の一又は複数の実施形態による表分割方法は、デジタル文書に含まれる表を構成するセルがヘッダ要素及び表データ要素のいずれに該当するかを推定する工程と、前記表が前記ヘッダ要素に該当するセルのみを含む第1ヘッダ列と、前記表データ要素に該当するセルのみを含む第1データ列と、前記ヘッダ要素に該当するセルのみを含む第2ヘッダ列と、前記表データ要素に該当するセルのみを含む第2データ列と、をこの順に含む場合に、前記第1データ列と前記第2ヘッダ列との間を境界として前記表を2つのサブテーブルに分割する工程と、を備える。
The table join system according to one or more embodiments of the present invention comprises one or more processors. The one or more processors detect the first table portion and the second table portion different from the first table portion from the digital document by executing a computer-readable instruction, and the first table portion and the first table portion. It is determined whether or not the two table portions can be combined, and when it is determined that the first table portion and the second table portion can be combined, the first table portion and the second table portion are combined. The table division method according to one or more embodiments of the present invention includes a step of estimating whether a cell constituting a table included in a digital document corresponds to a header element or a table data element, and the table is the header element. A first header column containing only cells corresponding to the above, a first data column containing only cells corresponding to the table data element, a second header column containing only cells corresponding to the header element, and the table data element. A step of dividing the table into two sub-tables with a boundary between the first data column and the second header column when the second data column containing only the cells corresponding to the above is included in this order. To prepare for.
本発明の一又は複数の実施形態による表結合方法は、一又は複数のコンピュータプロセッサがコンピュータ読み取り可能な命令を実行することにより実行される。当該表結合方法は、デジタル文書から第1表部分及び前記第1表部分と異なる第2表部分を検出する工程と、前記第1表部分と前記第2表部分とが結合可能か判定する工程と、前記第1表部分と前記第2表部分とが結合可能と判定された場合に、前記第1表部分と前記第2表部分とを結合する工程と、を備える。
The table join method according to one or more embodiments of the present invention is executed by one or more computer processors executing computer-readable instructions. The table joining method includes a step of detecting a first table portion and a second table portion different from the first table portion from a digital document, and a step of determining whether the first table portion and the second table portion can be joined. A step of joining the first table portion and the second table portion when it is determined that the first table portion and the second table portion can be combined is provided.
本発明の一又は複数の実施形態によるユーザ装置は、一又は複数のプロセッサを備え、前記一又は複数のプロセッサは、コンピュータ読み取り可能な命令を実行することにより、デジタル文書をサーバにアップロードし、前記デジタル文書から抽出された数値に基づく計算の実行結果及び当該計算の内容の少なくとも一方を示す注釈情報が付加された注釈付き文書を前記サーバから取得し、前記注釈付き文書を表示する。
The user apparatus according to one or more embodiments of the present invention comprises one or more processors, wherein the one or more processors upload a digital document to a server by executing a computer-readable instruction. An annotated document to which an execution result of a calculation based on a numerical value extracted from a digital document and an annotation information indicating at least one of the contents of the calculation is added is acquired from the server, and the annotated document is displayed.
本発明の一又は複数の実施形態による表分割プログラムは、一又は複数のプロセッサに、デジタル文書に含まれる表を構成するセルがヘッダ要素及び表データ要素のいずれに該当するかを推定する機能と、前記表が前記ヘッダ要素に該当するセルのみを含む第1ヘッダ列と、前記表データ要素に該当するセルのみを含む第1データ列と、前記ヘッダ要素に該当するセルのみを含む第2ヘッダ列と、前記表データ要素に該当するセルのみを含む第2データ列と、をこの順に含む場合に、前記第1データ列と前記第2ヘッダ列との間を境界として前記表を2つのサブテーブルに分割する機能と、を実行させる。本発明の一又は複数の実施形態による電子的な注釈付き非構造化文書の作成方法は、非構造化文書を取得する工程と、前記非構造化文書から第1表部分及び前記第1表部分と異なる第2表部分を検出する工程と、前記第1表部分と前記第2表部分とが結合可能か判定する工程と、前記第1表部分と前記第2表部分とが結合可能と判定された場合に、前記第1表部分と前記第2表部分とを結合する工程と、前記第1表部分に含まれる第1数値データ及び前記第2表部分に含まれる第2数値データに基づいて計算を実行する工程と、前記第1数値データ及び前記第2数値データに基づく計算の結果及び当該計算の内容の少なくとも一方を示す注釈情報を前記非構造化文書に付加することによって前記注釈付き非構造化文書を生成する工程と、を備える。
The table splitting program according to one or more embodiments of the present invention has a function of estimating whether a cell constituting a table included in a digital document corresponds to a header element or a table data element in one or more processors. , A first header column in which the table contains only cells corresponding to the header element, a first data column containing only cells corresponding to the table data element, and a second header containing only cells corresponding to the header element. When a column and a second data column containing only cells corresponding to the table data element are included in this order, the table is divided into two subs with the boundary between the first data column and the second header column as a boundary. The function to divide into tables and to execute. The method for creating an electronically annotated unstructured document according to one or more embodiments of the present invention includes a step of acquiring the unstructured document and a table 1 portion and a table 1 portion from the unstructured document. The step of detecting the second table portion different from the above, the step of determining whether the first table portion and the second table portion can be combined, and the step of determining whether the first table portion and the second table portion can be combined. Based on the step of connecting the first table portion and the second table portion, the first numerical data included in the first table portion, and the second numerical data included in the second table portion. Annotated by adding to the unstructured document the step of executing the calculation, the result of the calculation based on the first numerical data and the second numerical data, and the annotation information indicating at least one of the contents of the calculation. It comprises a step of generating an unstructured document.
本発明の実施形態によれば、デジタル文書において1つの表が複数の表部分に分割されている場合に複数の表部分を結合できる表結合プログラムを提供することができる。
According to an embodiment of the present invention, it is possible to provide a table joining program capable of joining a plurality of table parts when one table is divided into a plurality of table parts in a digital document.
以下、適宜図面を参照し、本発明の様々な実施形態を説明する。図1に示されているとおり、本発明の一実施形態による表結合システム1は、ユーザ装置10と、サーバ20と、を備える。表結合システム1は、ストレージ30を備えてもよい。ユーザ装置10、サーバ20、及びストレージ30は、ネットワーク40を介して互いに通信可能に接続されている。ネットワーク40は、単一のネットワークであってもよく、複数のネットワークが接続されて構成されていてもよい。ネットワーク40は、例えば、インターネット、移動通信網、及びこれらの組み合わせである。ネットワーク40としては、電子機器間の通信を可能とする任意のネットワークが適用され得る。
Hereinafter, various embodiments of the present invention will be described with reference to the drawings as appropriate. As shown in FIG. 1, the table coupling system 1 according to the embodiment of the present invention includes a user device 10 and a server 20. The table coupling system 1 may include storage 30. The user device 10, the server 20, and the storage 30 are communicably connected to each other via the network 40. The network 40 may be a single network, or may be configured by connecting a plurality of networks. The network 40 is, for example, the Internet, a mobile communication network, and a combination thereof. As the network 40, any network that enables communication between electronic devices can be applied.
図1に示されている表結合システム1は、本発明を適用可能なシステムの例であり、本発明を適用可能なシステムは、図1に示されたものには限定されない。本発明を適用可能な表結合システム1は、図示されている構成要素の一部を備えなくてもよい。例えば、表結合システム1は、ストレージ30を備えなくともよい。表結合システム1は、図示されていない構成要素を備えてもよい。例えば、図1には説明の簡略化のためにユーザ装置10が1台だけ記載されているが、表結合システム1は、2以上の任意の数のユーザ装置10を含むことができる。
The table coupling system 1 shown in FIG. 1 is an example of a system to which the present invention can be applied, and the system to which the present invention can be applied is not limited to the system shown in FIG. The table coupling system 1 to which the present invention can be applied does not have to include some of the components shown. For example, the table join system 1 does not have to include the storage 30. The table coupling system 1 may include components (not shown). For example, although only one user device 10 is shown in FIG. 1 for the sake of brevity, the table coupling system 1 can include any number of two or more user devices 10.
図示されている表結合システム1は、ネットワーク40に接続されたユーザ装置10及びサーバ20を備えているが、表結合システム1のサブコンビネーションであるユーザ装置10及びサーバ20の一方を特許請求の範囲に記載されている発明が適用される表結合システムと理解することができる。特許請求の範囲に記載される表結合システムは、ユーザ装置10及びサーバ20の両方の構成や機能を構成要件としてもよいし、表結合システム1のサブコンビネーションであるユーザ装置10及びサーバ20の一方の構成や機能を構成要件とし他方の構成や機能を構成要件としなくともよい。言い換えると、請求項に記載されている表結合システムがユーザ装置10及びサーバ20の両方の構成や機能を構成要件として備える場合には、ユーザ装置10及びサーバ20を備える表結合システム1が請求項に記載されている表結合システムに係る発明の実施形態に相当する。他方、請求項に記載されている表結合システムがユーザ装置10の構成や機能のみを構成要件として備える場合には、図示されているユーザ装置10が請求項に記載されている表結合システムに係る発明の実施形態に相当する。また、請求項に記載されている表結合システムがサーバ20の構成や機能のみを構成要件として備える場合には、図示されているサーバ20が請求項に記載されている表結合システムに係る発明の実施形態に相当する。
The illustrated table-joining system 1 includes a user device 10 and a server 20 connected to a network 40, but one of the user device 10 and the server 20, which is a subcombination of the table-joining system 1, is claimed. Can be understood as a table-joining system to which the invention described in is applied. The table-joining system described in the claims may include the configuration and functions of both the user device 10 and the server 20 as constituent requirements, or one of the user device 10 and the server 20 which is a subcombination of the table-joining system 1. It is not necessary to make the configuration or function of one a constituent requirement and the other configuration or function a constituent requirement. In other words, when the table-joining system described in claim includes the configurations and functions of both the user apparatus 10 and the server 20 as constituent requirements, the table-joining system 1 including the user apparatus 10 and the server 20 is claimed. Corresponds to the embodiment of the invention according to the table coupling system described in. On the other hand, when the table-joining system described in the claim includes only the configuration and the function of the user device 10 as a constituent requirement, the user device 10 shown in the figure relates to the table-joining system described in the claim. Corresponds to an embodiment of the invention. Further, when the table-joining system described in the claims includes only the configuration and functions of the server 20 as a constituent requirement, the illustrated server 20 is the invention relating to the table-joining system described in the claims. Corresponds to the embodiment.
表結合システム1は、デジタル文書に含まれる表又はその一部である表部分を特定する。デジタル文書においては、1つの表が横方向に長すぎて1ページの幅に収まらない場合や、表が縦方向に長すぎて1ページ内に表示できない場合に、当該1つの表が複数の表部分に分割して表示されていることがある。表結合システム1は、1つの表が複数の表部分に分割された複数の表部分を結合することができる。表結合システム1が取り扱うのに適したデジタル文書は、例えば、構造の定義を有していない非構造化文書(Unstructured Document)である。非構造化文書は、例えば、PDF形式の文書である。
The table join system 1 identifies a table included in a digital document or a table portion that is a part thereof. In a digital document, if one table is too long to fit on one page, or if the table is too long to fit on one page, then the table is multiple tables. It may be divided into parts and displayed. The table join system 1 can join a plurality of table parts in which one table is divided into a plurality of table parts. A digital document suitable for handling by the table-joining system 1 is, for example, an unstructured document having no structure definition. The unstructured document is, for example, a document in PDF format.
まず、ユーザ装置10の構成について説明する。ユーザ装置10は、パーソナルコンピュータ(PC)、タブレット端末、スマートフォン、またはこれら以外の各種情報処理装置である。ユーザ装置10は、プロセッサ11、メモリ12、ユーザインタフェース13、通信インタフェース14、及びストレージ15を備えている。
First, the configuration of the user device 10 will be described. The user device 10 is a personal computer (PC), a tablet terminal, a smartphone, or various information processing devices other than these. The user apparatus 10 includes a processor 11, a memory 12, a user interface 13, a communication interface 14, and a storage 15.
プロセッサ11は、ストレージ15又はそれ以外のストレージからオペレーティングシステムやそれ以外の様々なプログラムをメモリ12にロードし、ロードしたプログラムに含まれる命令を実行する演算装置である。プロセッサ11は、例えば、CPU、MPU、DSP、GPU、これら以外の各種演算装置、又はこれらの組み合わせである。プロセッサ11は、ASIC、PLD、FPGA、MCU等の集積回路により実現されてもよい。
The processor 11 is an arithmetic unit that loads an operating system and various other programs from the storage 15 or other storage into the memory 12 and executes instructions included in the loaded program. The processor 11 is, for example, a CPU, an MPU, a DSP, a GPU, various arithmetic units other than these, or a combination thereof. The processor 11 may be realized by an integrated circuit such as an ASIC, PLD, FPGA, MCU or the like.
メモリ12は、プロセッサ11が実行する命令及びそれ以外の各種データを格納するために用いられる。メモリ12は、プロセッサ11が高速にアクセス可能な主記憶装置(メインメモリ)である。メモリ12は、例えば、DRAMやSRAM等のRAMによって構成される。
The memory 12 is used to store instructions executed by the processor 11 and various other data. The memory 12 is a main storage device (main memory) that the processor 11 can access at high speed. The memory 12 is composed of, for example, a RAM such as a DRAM or an SRAM.
ユーザインタフェース13は、ユーザの入力を受け付ける入力インタフェースと、プロセッサ11の制御により様々な情報を出力する出力インタフェースと、を備える。入力インタフェースは、例えば、キーボード、マウス等のポインティングデバイス、タッチパネル、又は前記以外のユーザの入力を入力可能な任意の情報入力装置である。出力インタフェースは、例えば、液晶ディスプレイ、表示パネル、又は前記以外のプロセッサ11の演算結果を出力可能な任意の情報出力装置である。
The user interface 13 includes an input interface that accepts user input and an output interface that outputs various information under the control of the processor 11. The input interface is, for example, a pointing device such as a keyboard or a mouse, a touch panel, or any information input device capable of inputting input from a user other than the above. The output interface is, for example, a liquid crystal display, a display panel, or any information output device capable of outputting the calculation result of the processor 11 other than the above.
通信インタフェース14は、ハードウェア、ファームウェア、又はTCP/IPドライバやPPPドライバ等の通信用ソフトウェア又はこれらの組み合わせとして実装される。ユーザ装置10は、通信インタフェース14を介して、サーバ20等の他の装置とデータを送受信することができる。
The communication interface 14 is implemented as hardware, firmware, communication software such as a TCP / IP driver or PPP driver, or a combination thereof. The user device 10 can send and receive data to and from other devices such as the server 20 via the communication interface 14.
ストレージ15は、プロセッサ11によりアクセスされる外部記憶装置である。ストレージ15は、例えば、磁気ディスク、光ディスク、半導体メモリ、又はデータを記憶可能な前記以外の各種記憶装置である。ストレージ15は、デジタル文書15aを記憶してもよい。デジタル文書15aは、例えば、PDFファイル等の非構造化文書(Unstructured Document)であってもよい。
The storage 15 is an external storage device accessed by the processor 11. The storage 15 is, for example, a magnetic disk, an optical disk, a semiconductor memory, or various storage devices other than those capable of storing data. The storage 15 may store the digital document 15a. The digital document 15a may be, for example, an unstructured document such as a PDF file.
続いて、ユーザ装置10の機能について説明する。ユーザ装置10のプロセッサ11は、アップロード部11a及び表示部11bとして機能する。アップロード部11aは、デジタル文書をサーバ20にアップロードする。アップロード部11aは、例えばストレージ15からデジタル文書15aを読み出し、読み出したデジタル文書15aをサーバ20にアップロードすることができる。
Next, the function of the user device 10 will be described. The processor 11 of the user device 10 functions as an upload unit 11a and a display unit 11b. The upload unit 11a uploads the digital document to the server 20. For example, the upload unit 11a can read the digital document 15a from the storage 15 and upload the read digital document 15a to the server 20.
表示部11bは、デジタル文書をディスプレイに表示する。表示部11bは、例えば、ストレージ15aから読み出したデジタル文書15aをディスプレイに表示してもよい。表示部11bは、サーバ20からデジタル文書を受け取り、このサーバ20から受け取ったデジタル文書をディスプレイに表示してもよい。表示部11bは、サーバ20から、後述する注釈付き文書25cを受け取ってディスプレイに表示してもよい。
The display unit 11b displays a digital document on the display. The display unit 11b may display, for example, the digital document 15a read from the storage 15a on the display. The display unit 11b may receive a digital document from the server 20 and display the digital document received from the server 20 on the display. The display unit 11b may receive the annotated document 25c described later from the server 20 and display it on the display.
次に、サーバ20の構成について説明する。サーバ20は、プロセッサ21、メモリ22、ユーザインタフェース23、通信インタフェース24、ストレージ25を備えている。
Next, the configuration of the server 20 will be described. The server 20 includes a processor 21, a memory 22, a user interface 23, a communication interface 24, and a storage 25.
プロセッサ21は、オペレーティングシステムやアプリケーションを提供するための様々なプログラムをメモリ22にロードし、ロードしたプログラムに含まれる命令を実行する演算装置である。プロセッサ11に関する説明は、プロセッサ21にも当てはまり、メモリ12に関する説明はメモリ22にも当てはまる。
The processor 21 is an arithmetic unit that loads various programs for providing an operating system and an application into the memory 22 and executes instructions included in the loaded programs. The description of the processor 11 also applies to the processor 21, and the description of the memory 12 also applies to the memory 22.
ユーザインタフェース23は、サーバ20のオペレータの入力を受け付ける入力インタフェースと、プロセッサ21の制御により様々な情報を出力する出力インタフェースと、を備える。
The user interface 23 includes an input interface that accepts the input of the operator of the server 20 and an output interface that outputs various information under the control of the processor 21.
通信インタフェース24は、ハードウェア、ファームウェア、又はTCP/IPドライバやPPPドライバ等の通信用ソフトウェア又はこれらの組み合わせとして実装される。サーバ20は、通信インタフェース24を介して、他の装置とデータを送受信することができる。
The communication interface 24 is implemented as hardware, firmware, communication software such as a TCP / IP driver or PPP driver, or a combination thereof. The server 20 can send and receive data to and from other devices via the communication interface 24.
ストレージ25は、プロセッサ21によりアクセスされる外部記憶装置である。ストレージ25は、例えば、磁気ディスク、光ディスク、半導体メモリ、又はデータを記憶可能な前記以外の各種記憶装置である。
The storage 25 is an external storage device accessed by the processor 21. The storage 25 is, for example, a magnetic disk, an optical disk, a semiconductor memory, or various storage devices other than those capable of storing data.
ストレージ25には、デジタル文書に含まれる表又は表部分から表データを抽出し、この抽出された表データに基づいて表部分を結合するための表結合プログラム25a、表結合プログラムを実行することによって解析されるデジタル文書であるオリジナル文書25b、及び表結合プログラムを実行することによって付加される注釈を含む注釈付き文書25cが記憶されていてもよい。表結合プログラム25aに含まれる命令は、プロセッサ21によって実行され得る。表結合プログラム25aを実行することにより実現される機能の詳細は後述される。
By executing the table join program 25a and the table join program for extracting the table data from the table or the table part included in the digital document and joining the table parts based on the extracted table data in the storage 25. The original document 25b, which is a digital document to be analyzed, and the annotated document 25c, which includes annotations added by executing the table join program, may be stored. The instructions included in the table join program 25a may be executed by the processor 21. Details of the functions realized by executing the table join program 25a will be described later.
オリジナル文書25bは、例えばユーザ装置10からアップロードされたデジタル文書である。オリジナル文書25bは、一又は複数の表を含むことができる。上述のとおり、オリジナル文書25bは、構造の定義を有していないPDFファイル等の非構造化文書(Unstructured Document)であってもよい。非構造化文書は、文書を構成する各ページに含まれるテキストや画像等のオブジェクトと、そのオブジェクトのページ中での配置を示す座標情報と、を含むが、文書の構造を示す情報は含まない。
The original document 25b is, for example, a digital document uploaded from the user device 10. The original document 25b can include one or more tables. As described above, the original document 25b may be an unstructured document such as a PDF file that does not have a structure definition. An unstructured document includes objects such as text and images contained in each page constituting the document, and coordinate information indicating the arrangement of the objects in the page, but does not include information indicating the structure of the document. ..
注釈付き文書25cは、後述するように、オリジナル文書25bに含まれる数値データの計算の内容や結果を示す注釈情報がオリジナル文書25bに付加された文書である。注釈情報の詳細については後述する。オリジナル文書25bが非構造化文書の場合、注釈情報は、オリジナル文書25bのオブジェクトとしてストレージ25に記憶されてもよい。
As will be described later, the annotated document 25c is a document in which annotation information indicating the content and result of calculation of numerical data included in the original document 25b is added to the original document 25b. The details of the annotation information will be described later. When the original document 25b is an unstructured document, the annotation information may be stored in the storage 25 as an object of the original document 25b.
表結合システム1において、データの格納場所には特段の制限はない。例えば、ストレージ15に記憶され得る各種データは、ユーザ装置10とは物理的に別体のストレージ(例えば、ストレージ25又はストレージ30)やデータベースサーバに格納されてもよい。同様に、ストレージ25に記憶され得る各種データは、サーバ20とは物理的に別体のストレージ(例えば、ストレージ15又はストレージ30)やデータベースサーバに格納されてもよい。図1においては、ストレージ15及びストレージ25はそれぞれが単一のユニットとして図示されているが、ストレージ15、25の少なくとも一方は、複数の物理的に別体のストレージが集合したものであってもよい。つまり、本明細書において、ストレージ15に記憶されるデータ及びストレージ25に記憶されるデータは、単一のストレージに記憶されてもよいし、複数のストレージに分散して記憶されてもよい。また、本明細書及び特許請求の範囲において、単に「ストレージ」という場合には、文脈上許される限り、単一のストレージと複数のストレージの集合のいずれを指し示すこともある。
In the table join system 1, there are no particular restrictions on the data storage location. For example, various data that can be stored in the storage 15 may be stored in a storage (for example, storage 25 or storage 30) or a database server that is physically separate from the user device 10. Similarly, various data that can be stored in the storage 25 may be stored in a storage (for example, storage 15 or storage 30) or a database server that is physically separate from the server 20. In FIG. 1, the storage 15 and the storage 25 are each shown as a single unit, but at least one of the storages 15 and 25 may be a collection of a plurality of physically separate storages. good. That is, in the present specification, the data stored in the storage 15 and the data stored in the storage 25 may be stored in a single storage or may be distributed and stored in a plurality of storages. Also, in the present specification and claims, the term "storage" may refer to either a single storage or a collection of multiple storages, as is permitted in the context.
続いて、サーバ20の機能について説明する。サーバ20のプロセッサ21は、表結合プログラム25aに含まれる命令又はそれ以外の命令を実行することにより、表検出部21a、表データ抽出部21b、判定部21c、結合部21d、計算部21e、及び注釈付加部21fとして機能する。
Next, the function of the server 20 will be described. The processor 21 of the server 20 executes the instruction included in the table join program 25a or other instructions to execute the table detection unit 21a, the table data extraction unit 21b, the determination unit 21c, the connection unit 21d, the calculation unit 21e, and the calculation unit 21e. It functions as an annotation addition unit 21f.
表検出部21aは、解析対象のデジタル文書に含まれる表を検出する。表検出部21aは、例えば、解析対象のデジタル文書に矩形検出処理等の画像処理を行い、デジタル文書に含まれる矩形の要素を表として検出することができる。表検出部21aは、矩形検出処理以外に任意の公知の方法によりデジタル文書から表を検出することができる。解析対象のデジタル文書は、例えば、ストレージ25に記憶されているオリジナル文書25bである。オリジナル文書25bは、複数の表部分に分割された表を含むことがある。例えば、オリジナル文書25bにおいては、ページの幅よりも横方向に長い表が複数の表部分に分割されていることがある。また、オリジナル文書25bにおいては、表が縦方向に長すぎて1ページ内に表示できない場合に、当該1つの表が複数の表部分に分割して複数のページにまたがって配置されることがある。1つの表が複数の表部分に分割されている場合には、表検出部21aは、分割された複数の表部分の各々を1つの表として検出する。言い換えると、表検出部21aは、デジタル文書に含まれる表が1つの表の全体なのか1つの表の一部なのかを区別せずに、当該デジタル文書に含まれる表を検出する。例えば、デジタル文書に含まれている1つの表が第1表部分と第2表部分の2つに分割されている場合、その第1表部分及び第2表部分の各々を表として検出する。表検出部21aは、デジタル文書が複数のページを含む場合には、その複数のページの各々において表を検出する処理を行う。本明細書における「表」は、一般的な意味で用いられている。すなわち、本明細書における「表」は、文字や数字などのデータを罫線で区切られたセル内に記述して表したものを意味する。表においては、セルを区切る罫線の一部が省略され、空白の区切り線(whitespace)によってセルが区切られていてもよい。このように上下左右のいずれかが空白によって区切られたセルを空白区切りセル(whitespace delimited cell)と呼ぶことがある。
The table detection unit 21a detects the table included in the digital document to be analyzed. For example, the table detection unit 21a can perform image processing such as rectangle detection processing on the digital document to be analyzed, and detect the rectangular elements included in the digital document as a table. The table detection unit 21a can detect a table from a digital document by any known method other than the rectangle detection process. The digital document to be analyzed is, for example, the original document 25b stored in the storage 25. The original document 25b may include a table divided into a plurality of table portions. For example, in the original document 25b, a table that is longer in the horizontal direction than the width of the page may be divided into a plurality of table portions. Further, in the original document 25b, when a table is too long in the vertical direction to be displayed on one page, the one table may be divided into a plurality of table portions and arranged across a plurality of pages. .. When one table is divided into a plurality of table portions, the table detection unit 21a detects each of the divided plurality of table portions as one table. In other words, the table detection unit 21a detects the table contained in the digital document without distinguishing whether the table contained in the digital document is the whole table or a part of the table. For example, when one table included in a digital document is divided into two parts, a first table part and a second table part, each of the first table part and the second table part is detected as a table. When the digital document contains a plurality of pages, the table detection unit 21a performs a process of detecting a table in each of the plurality of pages. The term "table" as used herein is used in a general sense. That is, the "table" in the present specification means that data such as characters and numbers are described and represented in cells separated by ruled lines. In the table, a part of the ruled line separating the cells may be omitted, and the cells may be separated by a blank white space. A cell in which one of the top, bottom, left, and right is separated by a space in this way may be called a space delimited cell (whitespace delimited cell).
表データ抽出部21bは、表検出部21aにより検出された表の各々について、その表のレイアウト認識を行い、また、表を構成する複数のセル内に配置されている表データを抽出する。表データは、各セル内に配置されている文字、数字、又はそれ以外のデータである。表データ抽出部21bは、抽出した表データを表ごとにストレージ25に記憶してもよい。各表の表データは、各表を識別する表識別情報と対応付けてストレージ25に記憶されてもよい。
The table data extraction unit 21b recognizes the layout of each table detected by the table detection unit 21a, and also extracts table data arranged in a plurality of cells constituting the table. The table data is characters, numbers, or other data arranged in each cell. The table data extraction unit 21b may store the extracted table data in the storage 25 for each table. The table data of each table may be stored in the storage 25 in association with the table identification information that identifies each table.
一実施形態において、表データ抽出部21bは、表検出部21aにより検出された表の各々の表データを画像ベース方式、テキストベース方式、又はこれら以外の公知の方式により抽出することができる。テキストベース方式は、PDFファイルをテキストファイルに変換し、そのテキストファイルに基づいて表データを抽出する方式である。画像ベース方式で表データの抽出を行う場合には、表データ抽出部21bは、オリジナル文書25bを画像ファイルに変換し、この画像ファイルを用いて表レイアウトの認識及び表データの抽出を行う。表データ抽出部21bは、例えば、デジタル文書から変換された画像ファイルに対して矩形検出処理を行って、表に含まれる矩形の図形要素を検出し、検出した矩形の図形要素をセルとして認識する。また、表データ抽出部21bは、検出した各セルの座標情報、幅、高さを検出し、この座標情報をもとに、同一の列及び同一の行に属するセルを認識する。このようにして認識されるセルは、罫線によって囲まれている。また、表データ抽出部21bは、例えばOCR(Optical Character Recognition/Reader)により、各セル内に配置されている表データを検出する。
In one embodiment, the table data extraction unit 21b can extract each table data of the table detected by the table detection unit 21a by an image-based method, a text-based method, or a known method other than these. The text-based method is a method of converting a PDF file into a text file and extracting table data based on the text file. When extracting table data by an image-based method, the table data extraction unit 21b converts the original document 25b into an image file, and recognizes the table layout and extracts the table data using this image file. For example, the table data extraction unit 21b performs rectangular detection processing on an image file converted from a digital document, detects rectangular graphic elements included in the table, and recognizes the detected rectangular graphic elements as cells. .. Further, the table data extraction unit 21b detects the coordinate information, width, and height of each detected cell, and recognizes cells belonging to the same column and the same row based on the coordinate information. The cells recognized in this way are surrounded by ruled lines. Further, the table data extraction unit 21b detects the table data arranged in each cell by, for example, OCR (Optical Character Recognition / Reader).
表データ抽出部21bは、罫線で区切られたセルが空白区切りセルを含むか否かを判定することができる。例えば、罫線で区切られたセル内に、複数の行又は列に分かれたテキストが含まれている場合には、その複数の行の間及び/又は複数の列の間に空白の区切り線があると判定することができる。この場合、表データ抽出部21bは、罫線で区切られたセルの中に複数の空白区切りセルを認識し、この空白区切りセル内に含まれている表データを検出することができる。
The table data extraction unit 21b can determine whether or not the cells separated by the ruled lines include the blank-separated cells. For example, if a cell separated by a ruled line contains text divided into multiple rows or columns, there is a blank separator line between the multiple rows and / or between the columns. Can be determined. In this case, the table data extraction unit 21b can recognize a plurality of blank-separated cells in the cells separated by the ruled line, and can detect the table data contained in the blank-separated cells.
表データ抽出部21bは、テキストベース方式で表データの抽出を行う場合には、文書データをテキストファイルに変換し、そのテキストファイルに含まれるテキストにページ内での座標を付与したテキストファイルを生成し、この座標情報に基づいてテキスト間の位置関係を解析することで表のレイアウト認識を行う。
When extracting table data by the text-based method, the table data extraction unit 21b converts the document data into a text file and generates a text file in which the text contained in the text file is given the coordinates in the page. Then, the layout of the table is recognized by analyzing the positional relationship between the texts based on this coordinate information.
表データ抽出部21bは、画像ベース方式による表データの抽出処理及びテキストベース方式による表データの抽出を両方とも行うことができる。画像ベース方式による表データの抽出処理とテキストベース方式による表データの抽出とは並行して実行されてもよいし、いずれかが他方よりも先にシーケンシャルに実行されてもよい。
The table data extraction unit 21b can perform both an image-based method for extracting table data and a text-based method for extracting table data. The image-based method of extracting table data and the text-based method of extracting table data may be executed in parallel, or one of them may be executed sequentially before the other.
判定部21cは、表検出部21aによって検出された複数の表が他の表と結合可能か否かを判定する。本来1つの表であることが想定されている表が、ページレイアウトの都合等により複数の表部分に分割されている場合には、当該複数の表部分同士は結合可能とされる。判定部21cは、表検出部21aによって検出された一の表が他の表と結合可能か否かの判定を、表データ抽出部21bにより検出された表データに基づいて行うことができる。具体的には、判定部21cは、例えば、表検出部21aによって検出された2つの表の表データに含まれるヘッダ情報が互いに同一の場合に、その2つの表同士が結合可能と判定することができる。ただし、2つの表のヘッダ情報同士が互いに同一であっても、その2つの表は別の表であることを意図して作成された可能性もある。そこで、判定部21cは、ヘッダ情報が互いに同一と判定された2つの表の間に、表データ以外のテキストデータが含まれているか否かを判定し、2つの表の間にテキストが含まれていない場合に、当該2つの表を結合可能と判定してもよい。2つの表の間にテキストが存在しない場合には、その2つの表の各々は、1つの表が分割された表部分であると推定される。このため、判定部21cは、2つの表の間にテキストが存在せず当該2つの表同士でヘッダ情報が同一である場合に当該2つの表が結合可能と判定することができる。2つの表が異なる2つのページにまたがって配置されていることがある。この場合には、2つの表の間に、ページ番号、ページの記載領域の外にあるテキスト(例えば、ヘッダ、フッタ、脚注、文書参照記号番号)、表の凡例(例えば、会計文書における「(百万円)」という表示、表legendともいう。)、表のキャプション、又はそれ以外の文書の内容としては無視し得る情報が配置されていることがある。文書の内容と関係の無い情報が2つの表の間に配置されていても、当該2つの表が連続するページに跨がって表示されている場合(例えば、当該2つの表の一方が第10ページに配置され、当該2つの表の他方が第11ページに配置されている場合)には、当該2つの表は1つの表であることを意図して作成された可能性が高い。よって、判定部21cは、2つの表の各々が連続するページに配置されており当該2つの表のヘッダ情報が同一である場合に当該2つの表が結合可能と判定してもよい。また、無視し得るテキストを事前定義テキストとして予め定めておき、2つの表の間に事前定義テキストのみが配置されており当該2つの表のヘッダ情報が同一である場合に当該2つの表が結合可能と判定されてもよい。
The determination unit 21c determines whether or not the plurality of tables detected by the table detection unit 21a can be combined with other tables. When a table that is originally supposed to be one table is divided into a plurality of table parts due to page layout or the like, the plurality of table parts can be combined. The determination unit 21c can determine whether or not one table detected by the table detection unit 21a can be combined with another table based on the table data detected by the table data extraction unit 21b. Specifically, the determination unit 21c determines that, for example, when the header information included in the table data of the two tables detected by the table detection unit 21a is the same as each other, the two tables can be combined. Can be done. However, even if the header information of the two tables is the same as each other, it is possible that the two tables were created with the intention of being different tables. Therefore, the determination unit 21c determines whether or not text data other than the table data is included between the two tables for which the header information is determined to be the same, and the text is included between the two tables. If not, it may be determined that the two tables can be combined. If there is no text between the two tables, each of the two tables is presumed to be a divided table portion of one table. Therefore, the determination unit 21c can determine that the two tables can be combined when there is no text between the two tables and the header information is the same between the two tables. Two tables may be spread across two different pages. In this case, between the two tables, the page number, the text outside the page description area (eg, header, footnote, footnote, document reference symbol number), the table legend (eg, "(eg," in the accounting document. Information that can be ignored as the content of a document such as "1 million yen)", a table legend), a table caption, or other documents may be arranged. Even if information that is not related to the contents of the document is placed between the two tables, if the two tables are displayed across consecutive pages (for example, one of the two tables is the first). If it is arranged on page 10 and the other of the two tables is arranged on page 11), it is likely that the two tables were created with the intention of being one table. Therefore, the determination unit 21c may determine that the two tables can be combined when the two tables are arranged on consecutive pages and the header information of the two tables is the same. Also, if the text that can be ignored is predetermined as the predefined text and only the predefined text is placed between the two tables and the header information of the two tables is the same, the two tables are combined. It may be determined that it is possible.
図2及び図3に、結合可能な表の例を示す。図2には、表T11と、表T12とが示されている。表T11と表T12は、例えば、オリジナル文書25bとして記憶されているデジタル文書の同一のページに配置されている。図示されている例において、表T11及び表T12はいずれも、1つの株主資本等変動計算書の一部を構成する表部分である。表T11の行ヘッダ情報T11aは、上から順に「当期首残高」、「当期変動額」・・・「当期末残高」というテキストであり、表T12の行ヘッダ情報T12aも上から順に「当期首残高」、「当期変動額」・・・「当期末残高」というテキストである。よって、表T11の行ヘッダ情報T11aと表T12の行ヘッダ情報T12aとは一致している。また、表T11と表T12との間にテキストは配置されていない。よって、判定部21cは、表T11と表T12とが結合可能と判定することができる。例えば、ページ幅の制約により1つの表を1ページの幅内に表示できない場合に、表T11及び表T12のように1つの表が2つの表に分割されてデジタル文書に含められていることがある。このように、判定部21cは、1つの表が横方向に水平分割された2つの表の各々の行ヘッダ情報に基づいて、このデジタル文書において分割されている2つの表同士を結合可能(水平結合可能)と判定することができる。
2 and 3 show examples of tables that can be combined. FIG. 2 shows Table T11 and Table T12. Table T11 and Table T12 are arranged, for example, on the same page of a digital document stored as the original document 25b. In the illustrated example, both Table T11 and Table T12 are table parts that form part of one statement of changes in shareholders' equity. The row header information T11a in the table T11 has the texts "balance at the beginning of the current period", "variable amount for the current period" ... "Balance", "Floating amount for the current period" ... "Balance at the end of the current period". Therefore, the row header information T11a in the table T11 and the row header information T12a in the table T12 match. Further, no text is arranged between the table T11 and the table T12. Therefore, the determination unit 21c can determine that the table T11 and the table T12 can be combined. For example, when one table cannot be displayed within the width of one page due to the limitation of the page width, one table may be divided into two tables and included in the digital document as in table T11 and table T12. be. In this way, the determination unit 21c can join the two tables divided in this digital document based on the row header information of each of the two tables in which one table is horizontally divided in the horizontal direction (horizontal). Can be combined).
図3には表T21及び表T22が示されている。表T21及び表T22はいずれも連結キャッシュ・フロー計算書の一部を示している。図3において、表T21はデジタル文書の第48ページに配置されており、表T22はデジタル文書の第49ページに配置されていることが想定されている。表T21列ヘッダ情報T21aは、左から順に「前連結会計年度(自 2017年4月1日 至 2018年3月31日)」及び「当連結会計年度(自 2018年4月1日 至 2019年3月31日)」というテキストであり、表T22の列ヘッダ情報T22aも左から順に「前連結会計年度(自 2017年4月1日 至 2018年3月31日)」及び「当連結会計年度(自 2018年4月1日 至 2019年3月31日)」である。よって、表T21の列ヘッダ情報T21aと表T22の列ヘッダ情報T22aとは一致している。また、表T21と表T22との間には、ページ番号を示す「-48-」及び連結キャッシュ・フロー計算書の凡例である「単位:千円」を表すテキストが配置されている。ページ番号及び表の凡例として頻出する「単位:千円」という表の注記を事前定義テキストとして登録しておくことにより、判定部21cは、表T21と表T22とが結合可能と判定することができる。すなわち、判定部21cはテキストを事前定義テキストとして分類するために使用されるテキストパターンのリストを有する。例えば、「-48-」というテキストは、「-」に挟まれた数字というテキストパターンを有するので、「ページ番号」というカテゴリに分類されうる。事前定義テキストの各カテゴリには重要度が付されていてもよい。例えば、「ページ番号」、「文書参照記号」、「表の凡例」というカテゴリの重要度は「非重要」とされてもよい。判定部21cは、2つの表が結合可能か否かを判定する際に、それら2つの表の間にある全てのテキストが「非重要」のカテゴリに属するものであるか判定し、そうであれば結合可能と判定することができる。別の実施形態において、事前定義テキストが定義されていなくても、判定部21cは、共通のヘッダ情報を持つ表T21と表T22が連続したページ(第48ページと第49ページ)に配置されていることにより、表T21と表T22とが結合可能と判定できる。例えば、ページ高さの制約により1つの表を1ページに表示できない場合に、表T21及び表T22のように1つの表が2つの表に分割されてデジタル文書に含められていることがある。このように、判定部21cは、1つの表が縦方向に垂直分割された2つの表の各々の列ヘッダ情報に基づいて、このデジタル文書において分割されている2つの表同士を結合可能(垂直結合可能)と判定することができる。
Table T21 and Table T22 are shown in FIG. Both Table T21 and Table T22 show a portion of the consolidated cash flow statement. In FIG. 3, it is assumed that Table T21 is located on page 48 of the digital document and Table T22 is located on page 49 of the digital document. Table T21 column header information T21a is, in order from the left, "previous consolidated fiscal year (April 1, 2017 to March 31, 2018)" and "current consolidated fiscal year (April 1, 2018 to 2019)". The text is "March 31)", and the column header information T22a in Table T22 is also "previous consolidated fiscal year (from April 1, 2017 to March 31, 2018)" and "current consolidated fiscal year" in order from the left. (Own April 1, 2018 to March 31, 2019) ". Therefore, the column header information T21a in the table T21 and the column header information T22a in the table T22 match. Further, between the table T21 and the table T22, a text representing "-48-" indicating the page number and "unit: 1,000 yen" which is a legend of the consolidated cash flow statement is arranged. By registering the note of the table "unit: 1,000 yen" that frequently appears as the page number and the legend of the table as the predefined text, the determination unit 21c can determine that the table T21 and the table T22 can be combined. can. That is, the determination unit 21c has a list of text patterns used to classify the text as predefined text. For example, the text "-48-" has a text pattern of numbers sandwiched between "-" and can be classified into the category "page number". Each category of predefined text may be weighted. For example, the importance of the categories "page number", "document reference symbol", and "table legend" may be "non-important". When determining whether the two tables can be joined, the determination unit 21c determines whether all the text between the two tables belongs to the "non-important" category, and even so. If it can be determined that it can be combined. In another embodiment, even if the predefined text is not defined, the determination unit 21c is arranged on a continuous page (pages 48 and 49) of the table T21 and the table T22 having common header information. Therefore, it can be determined that the table T21 and the table T22 can be combined. For example, when one table cannot be displayed on one page due to the limitation of page height, one table may be divided into two tables and included in a digital document as in table T21 and table T22. In this way, the determination unit 21c can join (vertically) the two tables divided in this digital document based on the column header information of each of the two tables in which one table is vertically divided vertically. Can be combined).
結合部21dは、判定部21cにより結合可能と判定された2つの表を結合する。例えば、図2に示されている表T11及び表T12が判定部21cによって結合可能と判定された場合には、結合部21dは、表T11と表T12とを結合する。表T11と表T12とを結合することにより、表検出部21aによって検出された表とは別の表が生成される。本明細書においては、オリジナル文書25bとして記憶されているデジタル文書に含まれている2つの表を結合して得られる表を「結合表」と呼ぶことがある。結合部21dは、2つの表を結合して得られた結合表を識別する表識別情報と対応付けて、結合前の2つの表に含まれているデータをストレージ25に記憶することができる。
The joining unit 21d joins two tables determined to be connectable by the determination unit 21c. For example, when the table T11 and the table T12 shown in FIG. 2 are determined by the determination unit 21c to be connectable, the connection unit 21d joins the table T11 and the table T12. By combining the table T11 and the table T12, a table different from the table detected by the table detection unit 21a is generated. In the present specification, a table obtained by combining two tables included in a digital document stored as an original document 25b may be referred to as a "combined table". The join unit 21d can store the data contained in the two tables before the join in the storage 25 in association with the table identification information for identifying the join table obtained by joining the two tables.
結合部21dは、水平方向に分割されている2つの表(例えば、図2の表T11及び表T12)が結合可能と判定された場合には、その水平方向に分割されている2つの表を水平結合して結合表を作成することができる。水平結合された結合表においては、結合前の表における行データ同士が統合される。例えば、図2に示されている表T11と表T12とが水平結合された場合には、結合前の表T11において行ヘッダが「当期期首残高」である行に含まれるデータ及び結合前の表T12において行ヘッダが「当期期首残高」である行に含まれるデータの両方が、水平結合された結合表において行ヘッダが「当期期首残高」である行のデータとなる。
When the two tables divided in the horizontal direction (for example, the table T11 and the table T12 in FIG. 2) are determined to be connectable, the joining portion 21d uses the two tables divided in the horizontal direction. You can create a join table by horizontally joining. In a horizontally joined join table, the row data in the table before join is integrated. For example, when the table T11 and the table T12 shown in FIG. 2 are horizontally joined, the data and the table before the join are included in the row whose row header is "balance at the beginning of the current period" in the table T11 before the join. In T12, both of the data contained in the row whose row header is "balance at the beginning of the current period" are the data of the row whose row header is the "balance at the beginning of the current period" in the horizontally joined join table.
結合部21dは、垂直方向に分割されている2つの表(例えば、図3の表T21と表T22)が結合可能と判定された場合には、その垂直方向に分割されている2つの表を垂直結合して結合表を作成することができる。垂直結合された結合表においては、結合前の表における列データ同士が統合される。例えば、図3に示されている表T21と表T22とが垂直結合された場合には、結合前の表T21において列ヘッダが「前連結会計年度(自 2017年4月1日 至 2018年3月31日)」である列に含まれるデータ及び結合前の表T22において列ヘッダが「前連結会計年度(自 2017年4月1日 至 2018年3月31日)」である列に含まれるデータの両方が、垂直結合された結合表において行ヘッダが「前連結会計年度(自 2017年4月1日 至 2018年3月31日)」である行のデータとなる。
When it is determined that the two vertically divided tables (for example, the table T21 and the table T22 in FIG. 3) can be combined, the joining portion 21d uses the two vertically divided tables. You can create a join table by vertically joining. In a vertically joined join table, the column data in the table before join is integrated. For example, when the table T21 and the table T22 shown in FIG. 3 are vertically joined, the column header in the table T21 before the joining is "previous consolidated fiscal year (from April 1, 2017 to March 2018). The data included in the column "Month 31st)" and the column header in the table T22 before joining are included in the column whose column header is "Previous consolidated fiscal year (April 1, 2017 to March 31, 2018)". Both of the data will be the data of the row whose row header is "previous consolidated fiscal year (from April 1, 2017 to March 31, 2018)" in the vertically joined join table.
結合部21dは、3つ以上の表を結合してもよい。3つ以上の表を結合するか否かの判定基準は、2つの表を結合するか否かの判定基準と同じであってもよい。例えば、3つの表の行ヘッダ情報が互いに一致しており、3つの表の間にテキストが存在しないか、又は、事前定義テキストのみが存在する場合に、当該3つの表を結合して1つの結合表とすることができる。3つ以上の表は、段階的に結合されてもよい。例えば、第1の表、第2の表、及び第3の表を結合する場合、まず第1の表と第2の表とを結合して中間の結合表を作成し、この中間の結合表と第3の表とを結合して最終的な結合表を生成してもよい。
The joining portion 21d may join three or more tables. The criterion for joining three or more tables may be the same as the criterion for joining two tables. For example, if the row header information of the three tables matches each other and there is no text between the three tables, or only the predefined text, then the three tables are combined into one. It can be a join table. Three or more tables may be joined in stages. For example, when joining the first table, the second table, and the third table, first, the first table and the second table are joined to create an intermediate join table, and this intermediate join table is created. And the third table may be combined to generate the final combined table.
計算部21eは、表データ抽出部21bにより抽出された表データに含まれる数値、及び/又は、結合部21dにより2つの表を結合して得られた結合表の表識別情報と対応付けてストレージ25に記憶されている結合表の表データに含まれる数値に基づいて計算を実行することができる。例えば、デジタル文書に賃借対照表が含まれている場合、その賃借対照表の資産の部に含まれる資産の額が当該賃借対照表の表データに含まれている。計算部21eは、例えば、賃借対照表の表データに含まれている資産の額を全て加算することにより、資産の合計を計算することができる。計算部21eは、表データとして記憶されているデータの様々なサブセットごとに計算を実行することができる。例えば、賃借対照表の表データに含まれる流動資産の項目に含まれる数値を合計することにより流動資産合計を計算することができる。
The calculation unit 21e stores the numerical values included in the table data extracted by the table data extraction unit 21b and / or the table identification information of the join table obtained by joining the two tables by the join unit 21d. Calculations can be performed based on the numerical values contained in the table data of the join table stored in 25. For example, if the digital document contains a balance sheet, the amount of assets contained in the assets section of the balance sheet is included in the table data of the balance sheet. The calculation unit 21e can calculate the total amount of assets by adding all the amounts of assets included in the table data of the balance sheet, for example. The calculation unit 21e can execute the calculation for each various subset of the data stored as the table data. For example, the total current assets can be calculated by summing the numerical values included in the items of current assets included in the table data of the balance sheet.
上記のとおり、計算部21eは、結合表の表データに基づいて計算を行うこともできる。例えば、デジタル文書において賃借対照表が2ページに分かれて配置されている場合、この2ページに分かれて配置されている賃借対照表を構成する表部分同士が結合されていないと、表データに基づいて賃借対照表に含まれている数値を正しく計算することができない。例えば、賃借対照表が2ページに分かれて配置されており、先行するページに配置されている表部分には流動資産の全部と固定資産の一部が含まれており後続のページに配置されている表部分には固定資産の残部が配置されている場合には、この2分割されている賃借対照表が結合されていないと、流動資産と固定資産との合計である資産合計を計算することができない。本発明の実施形態によれば、結合部21dにより、2分割された賃借対照表のように本来1つの表であった2つの表(賃借対照表が分割された2つの表部分)が結合されるので、デジタル文書が分割された表を含む場合であっても結合された結合表の表データに基づいて正しく計算を行うことができる。デジタル文書が2ページに分かれて配置された賃借対照表を含む場合でも、本願発明の実施形態によれば、その賃借対照表に含まれる数値に基づいて資産額、負債額及びその他の計算により算出されるべき数値を正しく計算することができる。
As described above, the calculation unit 21e can also perform the calculation based on the table data of the join table. For example, if the balance sheet is divided into two pages in a digital document, and the table parts that make up the balance sheet that are divided into two pages are not joined, it is based on the table data. The figures contained in the balance sheet cannot be calculated correctly. For example, the balance sheet is divided into two pages, and the table part placed on the preceding page contains all of the current assets and part of the fixed assets and is placed on the following pages. If the balance of fixed assets is placed in the table part, and if this two-divided balance sheet is not combined, calculate the total assets, which is the total of current assets and fixed assets. I can't. According to the embodiment of the present invention, the joining portion 21d joins two tables (two table portions in which the balance sheet is divided), which was originally one table like a two-divided balance sheet. Therefore, even if the digital document contains a divided table, the calculation can be performed correctly based on the table data of the combined table. Even if the digital document contains a balance sheet that is divided into two pages, according to the embodiment of the present invention, it is calculated by the amount of assets, the amount of liabilities and other calculations based on the numerical values contained in the balance sheet. The number to be done can be calculated correctly.
注釈付加部21fは、表の表データに基づいて計算部21eによって行われた計算の結果及び/又は内容を示す注釈情報をデジタル文書に付加することができる。注釈情報が付加されたデジタル文書を表示する際には、オリジナルのデジタル文書に含まれる各種オブジェクトに加えて、当該注釈情報に対応するオブジェクトが当該デジタル文書に含まれる表の中又は表の近傍に付加される。本明細書では、デジタル文書の表とともに表示される注釈情報に対応するオブジェクト(又は注釈情報を表すオブジェクト)を単に注釈情報と呼ぶことがある。図4には、デジタル文書に含まれている連結賃借対照表T31の一部が示されている。図4の連結賃借対照表T31には、注釈情報が付加されている。注釈情報には、計算の範囲を示す矢印A1~A10と、計算部21eによる計算結果が表データのうちその計算結果に対応するデータと一致していないことを示す不一致マークS1と、計算部21eによる計算結果が表データのうちその計算結果に対応するデータと一致していることを示す一致マークS2とが含まれる。
The annotation addition unit 21f can add annotation information indicating the result and / or content of the calculation performed by the calculation unit 21e based on the table data of the table to the digital document. When displaying a digital document with annotation information, in addition to the various objects contained in the original digital document, the object corresponding to the annotation information is in or near the table included in the digital document. Will be added. In the present specification, the object corresponding to the annotation information displayed together with the table of the digital document (or the object representing the annotation information) may be referred to simply as the annotation information. FIG. 4 shows a portion of the consolidated balance sheet T31 contained in the digital document. Annotation information is added to the consolidated balance sheet T31 of FIG. The annotation information includes arrows A1 to A10 indicating the range of calculation, a discrepancy mark S1 indicating that the calculation result by the calculation unit 21e does not match the data corresponding to the calculation result in the table data, and the calculation unit 21e. Includes a match mark S2 indicating that the calculation result according to is matched with the data corresponding to the calculation result among the table data.
計算範囲を示す矢印A1~A10はそれぞれ始点から終点まで延びており、各矢印の始点と終点との間にあるセルの数値が計算対象となり、終点が到達しているセルにその計算結果が表示されていることを示している。図4の例において、例えば、矢印A1は、「前連結会計年度(2018年3月31日)」というテキストを含む列ヘッダセルと同じ列に含まれるセルのうち、流動資産の項目の開始セル(「現金及び預金」のセル)から流動資産の項目の終了セル(「流動資産合計」のセル)まで延びている。この矢印A1により、「前連結会計年度(2018年3月31日)」の列において矢印A1の始点から終点の間にある「現金及び預金」、「受取手形及び売掛金」、「たな卸資産」、「未収入金」、「その他」、及び「貸倒引当金」の各行に対応するセルに含まれている数値が加算され、この加算結果が矢印A1の終点に対応する「流動資産合計」の行のセルに表示されていることが示されている。ただし、「貸倒引当金」の行の数値には「△」が付されているため、この「貸倒引当金」の行の数値はマイナスに変換した上で加算される(つまり、減算される)。矢印A1が指し示している計算範囲において計算部21eが計算した結果は「38,545,156」であり、この計算結果は、連結賃借対照表T31の「流動資産合計」の行のセルに表示されている数値と等しいので、「前連結会計年度(2018年3月31日)」の列における「流動資産合計」の行のセルには、計算部21eの計算結果と解析対象としたオリジナル文書25bに含まれる表の数値とが一致していることを示す一致マークS2が付されている。
Arrows A1 to A10 indicating the calculation range extend from the start point to the end point, and the numerical value of the cell between the start point and the end point of each arrow is the calculation target, and the calculation result is displayed in the cell where the end point has reached. It shows that it has been done. In the example of FIG. 4, for example, arrow A1 is the start cell of the item of current assets among the cells contained in the same column as the column header cell containing the text "previous consolidated fiscal year (March 31, 2018)". It extends from the "Cash and Deposits" cell) to the end cell of the Current Assets item ("Total Current Assets" cell). This arrow A1 indicates "cash and deposits", "notes and accounts receivable", and "inventories" between the start and end points of arrow A1 in the column of "previous consolidated fiscal year (March 31, 2018)". , "Accounts receivable", "Other", and "Allowance for doubtful accounts" are added, and the result of this addition is "Total current assets" corresponding to the end point of arrow A1. It is shown to be displayed in the cell of the row. However, since the numbers in the "allowance for doubtful accounts" row are marked with "△", the numbers in this "allowance for doubtful accounts" row are converted to minus and then added (that is, subtracted). Ru). The result calculated by the calculation unit 21e in the calculation range indicated by the arrow A1 is "38,545,156", and this calculation result is displayed in the cell in the row of "Total liquid assets" in the consolidated balance sheet T31. In the cell of the row of "Total liquid assets" in the column of "Previous consolidated fiscal year (March 31, 2018)", the calculation result of the calculation unit 21e and the original document 25b to be analyzed are displayed. A matching mark S2 is attached to indicate that the numerical values in the table included in the above match.
矢印A2~A10も矢印A1と同様に計算範囲を示している。例えば、矢印A2は、「当連結会計年度(2019年3月31日)」というテキストを含む列ヘッダセルと同じ列に含まれるセルのうち、流動資産の項目の開始セル(「現金及び預金」のセル)から流動資産の項目の終了セル(「流動資産合計」のセル)まで延びているから、矢印A1と同様に、矢印の始点から終点の間にある「現金及び預金」、「受取手形及び売掛金」、「たな卸資産」、「未収入金」、「その他」、及び「貸倒引当金」の各行に対応するセルに含まれている数値が加算され、この加算結果が矢印A2の終点に対応する「流動資産合計」の行のセルに表示されていることを示している。矢印A2が指し示している計算範囲において計算部21eが計算した結果は「46,398,832」であり、この計算結果は、連結賃借対照表T31の「流動資産合計」の行のセルに表示されている「46,398,833」と比べて「1」だけ小さい。このため、「当連結会計年度(2019年3月31日)」の列における「流動資産合計」の行のセルには、計算部21eの計算結果と解析対象としたオリジナル文書25bに含まれる表の数値とが不一致であることを示す不一致マークS1が付されている。また、正しい計算結果は、表記されている数値よりも「1」だけ小さいことを示すために、この不一致マークS1とともに差分の値を示す差分情報として「-1」が小さく表示されている。なお、差分情報に加えてまたは代えて、不一致の場合に、計算部21eによる計算結果(上記の例では「46,398,833」)が表示されてもよい。
Arrows A2 to A10 also indicate the calculation range in the same way as arrow A1. For example, arrow A2 indicates the start cell of the item of current assets (“Cash and deposits”” among the cells included in the same column as the column header cell containing the text “Current consolidated fiscal year (March 31, 2019)”. Since it extends from the cell) to the end cell of the item of current assets (cell of "total current assets"), "cash and deposits", "notes receivable and" between the start point and the end point of the arrow are similar to arrow A1. The numbers contained in the cells corresponding to the "Accounts receivable", "Inventory assets", "Accounts receivable", "Others", and "Accounts for bad debts" rows are added, and the result of this addition is the end point of arrow A2. It shows that it is displayed in the cell of the row of "Total current assets" corresponding to. The result calculated by the calculation unit 21e in the calculation range indicated by the arrow A2 is "46,398,832", and this calculation result is displayed in the cell of the "Total current assets" row of the consolidated balance sheet T31. It is smaller by "1" than the "46,398,833". Therefore, in the cell of the row of "Total current assets" in the column of "Current consolidated fiscal year (March 31, 2019)", the calculation result of the calculation unit 21e and the table included in the original document 25b to be analyzed are included. A mismatch mark S1 is attached to indicate that the numerical values of are inconsistent with each other. Further, in order to show that the correct calculation result is smaller by "1" than the indicated numerical value, "-1" is displayed small as the difference information indicating the difference value together with the discrepancy mark S1. In addition to or instead of the difference information, the calculation result by the calculation unit 21e (“46,398,833” in the above example) may be displayed in the case of a mismatch.
賃借対照表において、固定資産の欄には、固定資産に含まれる項目の一部を合算した小計を示す項目が含まれている。注釈情報の矢印は、このような小計の計算範囲及び計算結果を示すこともできる。例えば、連結賃借対照表T31において、「建物及び構築物(純額)」は、固定資産に含まれる項目のうち「建物及び構築物」の数値と「減価償却累計額」の数値とを合計した小計である。矢印A3は、この「建物及び構築物(純額)」を計算する際の計算範囲を示している。矢印A3が指し示している計算範囲において計算部21eが計算した結果は「2,002,570」であり、この計算結果は、連結賃借対照表T31の「建物及び構築物(純額)」の行のセルに表示されている「2,002,569」と比べて「1」だけ多い。このため、「前連結会計年度(2018年3月31日)」の列における「建物及び構築物(純額)」の行のセルには、計算部21eの計算結果と解析対象としたオリジナル文書25bに含まれる表の数値とが不一致であることを示す不一致マークS1が付されている。また、正しい計算結果は、表記されている数値よりも「1」だけ大きいことを示すために、この不一致マークS1とともに差分情報「1」が小さく表示されている。
In the balance sheet, the column of fixed assets includes an item showing a subtotal that is the sum of some of the items included in fixed assets. The arrow of the annotation information can also indicate the calculation range and the calculation result of such a subtotal. For example, in the consolidated balance sheet T31, "buildings and structures (net amount)" is a subtotal of the values of "buildings and structures" and the values of "accumulated depreciation" among the items included in fixed assets. be. Arrow A3 indicates the calculation range when calculating this "building and structure (net amount)". The result calculated by the calculation unit 21e in the calculation range indicated by the arrow A3 is "2,002,570", and this calculation result is in the row of "Buildings and structures (net amount)" in the consolidated balance sheet T31. Compared to "2,002,569" displayed in the cell, there is only "1" more. Therefore, in the cell of the row of "Buildings and structures (net amount)" in the column of "Previous consolidated fiscal year (March 31, 2018)", the calculation result of the calculation unit 21e and the original document 25b to be analyzed are displayed. A mismatch mark S1 is attached to indicate that the numerical values in the table included in the above are inconsistent. Further, in order to show that the correct calculation result is larger by "1" than the indicated numerical value, the difference information "1" is displayed small together with the mismatch mark S1.
矢印A10は、固定資産のうち「有形固定資産合計」を計算するための計算範囲を示している。図4に例示されている連結賃借対照表T31において「有形固定資産合計」は、「建物及び構築物(純額)」と、「機械装置及び運搬具(純額)」と、「その他(純額)」と、「土地」と、「建設仮勘定」との合計である。「建物及び構築物(純額)」、「機械装置及び運搬具(純額)」、及び「その他(純額)」はそれぞれ、固定資産の項目の一部の小計であるため、矢印A10の始点と終点との間には、計算対象とすべき項目と、その計算項目の計算結果である小計を示す項目と、が両方とも含まれている。このため、矢印A10の始点から終点までの範囲の数値を全て計算すると、固定資産の各項目を二重に計算してしまうことになる。例えば、「建物及び構築物(純額)」は、固定資産に含まれる項目のうち「建物及び構築物」の数値と「減価償却累計額」の数値とを加算した小計であるから、「有形固定資産合計」を計算するために矢印A10の範囲に含まれている「建物及び構築物」、「減価償却累計額」、及び「建物及び構築物(純額)」を全て計算対象としたのでは、「建物及び構築物」及び「減価償却累計額」が二重に計算されることになる。このため、「有形固定資産合計」を計算するためには、矢印A10に含まれている項目のうち「建物及び構築物」及び「減価償却累計額」は計算対象から除外される。矢印A10は、その始点から終点の間にある項目のうち計算対象とすべき項目に対応する区間を実線A10aで示し、計算対象から除外すべき項目を破線A10bで示している。これにより、表結合システム1のユーザは、注釈情報に基づいて計算対象となっている項目を容易に識別することができる。
Arrow A10 indicates the calculation range for calculating "total tangible fixed assets" among fixed assets. In the consolidated balance sheet T31 illustrated in FIG. 4, "total tangible fixed assets" includes "buildings and structures (net amount)", "mechanical equipment and carriers (net amount)", and "others (net amount)". ) ”,“ Land ”, and“ Construction in progress ”. Since "buildings and structures (net amount)", "mechanical equipment and carriers (net amount)", and "others (net amount)" are subtotals of some of the fixed asset items, the starting point of arrow A10. Between and the end point, both the item to be calculated and the item indicating the subtotal which is the calculation result of the calculation item are included. Therefore, if all the numerical values in the range from the start point to the end point of the arrow A10 are calculated, each item of the fixed asset will be calculated twice. For example, "buildings and structures (net amount)" is a subtotal obtained by adding the values of "buildings and structures" and the values of "accumulated depreciation" among the items included in fixed assets, so "tangible fixed assets". If all the "buildings and structures", "accumulated depreciation", and "buildings and structures (net amount)" included in the range of arrow A10 are included in the calculation for "total", then "buildings" And the structure "and" accumulated depreciation "will be calculated twice. Therefore, in order to calculate "total tangible fixed assets", "buildings and structures" and "accumulated depreciation" among the items included in arrow A10 are excluded from the calculation target. The arrow A10 indicates the section corresponding to the item to be calculated among the items between the start point and the end point by the solid line A10a, and the item to be excluded from the calculation target is indicated by the broken line A10b. As a result, the user of the table join system 1 can easily identify the item to be calculated based on the annotation information.
上記のとおり、計算部21eは、2つの表が結合された結合表の表データに基づいて計算を行うことができる。注釈付加部21fは、結合表の表データに基づいて計算された計算の結果及び/又は内容を示す注釈情報を結合表に付加することができる。図5a及び図5bには、図2に示されている表T11と表T12とを結合して得られた結合表CT1(株主資本等変動計算書)を注釈情報とともに示している。図5a及び図5bに示されている注釈情報のうち、矢印A11は、「当期首残高」というテキストを含む行ヘッダセルと同じ行に含まれるセルのうち、開始セル(「資本金」のセル)から終了セル(「株主資本合計」のセル)まで延びている。この矢印A11により、「当期首残高」の行において矢印A11の始点から終点の間にある各項目が加算され、この加算結果が矢印A11の終点に対応する「株主資本合計」の例のセルに表示されていることが示されている。ただし、「当期首残高」の行には小計も含まれているため、二重に計算されることを防止するため一部の項目は計算から除外されており、この計算から除外された項目に対応するセルにおいて矢印A11は点線で示されている。計算に含められるセルにおいては、矢印A11は実線で示されている。矢印A11が指し示している計算範囲において計算部21eが計算した結果は「14,535,608」であり、この計算結果は、図5a及び図5bに示されている結合表CT1の「株主資本合計」のセルに記載されている値と一致しているため、当該セルの右側に一致マークS2が付されている。
As described above, the calculation unit 21e can perform the calculation based on the table data of the join table in which the two tables are joined. The annotation addition unit 21f can add annotation information indicating the result and / or content of the calculation calculated based on the table data of the join table to the join table. 5a and 5b show the combined table CT1 (statement of changes in shareholders' equity) obtained by combining the tables T11 and T12 shown in FIG. 2 together with the annotation information. Of the annotation information shown in FIGS. 5a and 5b, arrow A11 is the start cell (the cell of "capital") among the cells contained in the same row as the row header cell containing the text "balance at the beginning of the period". To the end cell (cell of "total shareholders' equity"). By this arrow A11, each item between the start point and the end point of the arrow A11 is added in the row of "balance at the beginning of the current period", and the addition result is in the cell of the example of "total shareholders' equity" corresponding to the end point of the arrow A11. It is shown that it is displayed. However, since the "Balance at the beginning of the period" line also includes subtotals, some items are excluded from the calculation to prevent double calculation, and items excluded from this calculation. Arrow A11 is indicated by a dotted line in the corresponding cell. In the cells included in the calculation, arrow A11 is shown by a solid line. The result calculated by the calculation unit 21e in the calculation range indicated by the arrow A11 is "14,535,608", and this calculation result is the "total shareholders' equity" in the combined table CT1 shown in FIGS. 5a and 5b. Since it matches the value described in the cell, the match mark S2 is attached to the right side of the cell.
結合表CT1は、株主資本等変動計算書を表しているため、その末尾にある「総資産合計」は、表T11に含まれている「株主資本合計」と、表T12に含まれている「評価・換算差額等合計」との合計になる。このため、この結合表CT1の表データに基づく計算は、表T11の表データと表T12の表データとに跨がって行われる。図5a及び図5bにおいて、矢印A12は、「総資産合計」を算出するための計算範囲を示しているので、矢印A12に対応する計算範囲には、表T11に含まれている「株主資本合計」の数値と表T12に含まれている「評価・換算差額等合計」の数値とを含む必要がある。このため、表T11の「資本金」のセルから延びている矢印A12は、表T11内では終端せず表T12まで延びている。矢印A12が表T11と表T12とに跨っていること(または矢印A12が表T11の右端で終端しないこと)を示す表示態様としては、図5aに示されるように表T11の右端であたかも矢印A12が途中で切れているかのように表現し、表T12の対応する箇所から矢印A12が延びているように表現することができる。この場合、矢印A11と矢印A12とで表T11の右端における表示態様が異なるので、ユーザはどちらが終端していてどちらが終端していないか容易に識別することができる。あるいはまた、表T11の右端において矢印A12に所定の継続記号(丸、、四角、及び前記以外の矢印の終端と区別できる記号。)を付し、表T12の対応する箇所に同じまたは対応する継続記号を付して矢印A12の表示を再開させるようにしてもよい。この場合も、継続記号と矢印の終端を表す矢のマークとは異なるので、ユーザは矢印が下の表に継続するかしないかを容易に識別できる。このように、オリジナル文書25bにおいて1つの表が複数の表部分に分割されている場合でも、注釈情報は、当該複数の表部分に跨がって計算される計算範囲を正しく指し示すことができる。
Since the combined table CT1 represents the statement of changes in shareholders'equity, the "total assets" at the end of the statement are the "total shareholders' equity" included in table T11 and the "total shareholders' equity" included in table T12. It is the total with "total of evaluation / conversion difference, etc." Therefore, the calculation based on the table data of the joined table CT1 is performed across the table data of the table T11 and the table data of the table T12. In FIGS. 5a and 5b, the arrow A12 indicates the calculation range for calculating the “total assets”. Therefore, the calculation range corresponding to the arrow A12 includes the “total shareholders' equity” included in the table T11. It is necessary to include the numerical value of “” and the numerical value of “total evaluation / conversion difference, etc.” included in Table T12. Therefore, the arrow A12 extending from the "capital" cell in table T11 does not end in table T11 and extends to table T12. As a display mode indicating that the arrow A12 straddles the table T11 and the table T12 (or the arrow A12 does not terminate at the right end of the table T11), as shown in FIG. 5a, the right end of the table T11 is as if the arrow A12. Can be expressed as if it is cut off in the middle, and the arrow A12 can be expressed as extending from the corresponding portion of the table T12. In this case, since the display mode at the right end of the table T11 is different between the arrow A11 and the arrow A12, the user can easily identify which is terminated and which is not terminated. Alternatively, at the right end of Table T11, the arrow A12 is marked with a predetermined continuation symbol (a circle, a square, and a symbol distinguishable from the end of an arrow other than the above), and the corresponding continuation in Table T12 is the same or corresponding. A symbol may be added to restart the display of the arrow A12. Again, the continuation symbol and the arrow mark indicating the end of the arrow are different so that the user can easily identify whether the arrow continues or not in the table below. As described above, even when one table is divided into a plurality of table portions in the original document 25b, the annotation information can correctly indicate the calculation range calculated across the plurality of table portions.
以上のとおり、結合表に付加される注釈情報について株主資本等変動計算書を例に説明したが、株主資本等変動計算書以外の結合表においても、注釈情報は、結合表に含まれる複数の表部分に跨がって計算される計算範囲を正しく示すことができる。例えば、図3には、2ページに分かれているキャッシュ・フロー計算書が示されている。このキャッシュ・フロー計算書の一部である表T21及び表T22が結合されて結合表が作成される場合、キャッシュ・フロー計算書の末尾にある「現金及び現金同等物の増加額」は、表T21に含まれている「営業活動によるキャッシュ・フロー」及び「投資活動によるキャッシュ・フロー」と、表T22に含まれている「財務活動によるキャッシュ・フロー」との合計である。このため、この結合表の表データに基づく計算は、表T21に配置されている表データと表T22に配置されている表データとに跨がって行われ、注釈情報もこの表T21と表T22に跨がって行われる計算の計算範囲を正しく示すことができる。
As described above, the annotation information added to the join table has been explained using the statement of changes in shareholders' equity as an example, but even in the join table other than the statement of changes in shareholders' equity, the annotation information is included in multiple join tables. It is possible to correctly show the calculation range calculated across the table part. For example, FIG. 3 shows a cash flow statement divided into two pages. If Tables T21 and T22, which are part of this cash flow statement, are combined to create a combined table, the "Cash and Cash Equivalent Increase" at the end of the cash flow statement is the table. It is the total of "cash flows from operating activities" and "cash flows from investing activities" included in T21 and "cash flows from financing activities" included in Table T22. Therefore, the calculation based on the table data of this joined table is performed across the table data arranged in the table T21 and the table data arranged in the table T22, and the annotation information is also the table T21 and the table. It is possible to correctly indicate the calculation range of the calculation performed across T22.
注釈情報が付加されたデジタル文書は、注釈付き文書25cとしてストレージ25に記憶されてもよい。
The digital document to which the annotation information is added may be stored in the storage 25 as the annotated document 25c.
本明細書において明示的に説明されている注釈情報は、本願明細書に開示されている発明に適用可能な注釈情報の例に過ぎず、本願明細書に開示されている発明に適用可能な注釈情報は、本願明細書において具体的に説明されているものには限定されない。
The annotation information expressly described herein is merely an example of the annotation information applicable to the invention disclosed in the present application, and the annotation information applicable to the invention disclosed in the present application. The information is not limited to that specifically described herein.
図1において図示は省略されているが、サーバ20のプロセッサ21は、ユーザ装置10からデジタル文書の送付を要求するリクエストを受け付ける機能を実行することができる。この送付要求には、デジタル文書を識別する識別情報が含まれてもよい。プロセッサ21は、特定のデジタル文書の送付要求を受け付けると、当該デジタル文書をストレージ25から読み出し、読み出したデジタル文書をユーザ装置10に送信することができる。上記のとおり、ストレージ25には、オリジナルのデジタル文書であるオリジナル文書25b、及び、このオリジナルのデジタル文書に注釈情報が付加された注釈付き文書25cが記憶されてもよい。ユーザ装置10からの送信要求には、オリジナルのデジタル文書と注釈情報が付加されたデジタル文書のいずれの送信を要求するかを識別するための文書種類特定情報が含まれてもよい。
Although not shown in FIG. 1, the processor 21 of the server 20 can execute a function of receiving a request for sending a digital document from the user device 10. The delivery request may include identification information that identifies the digital document. Upon receiving the request for sending a specific digital document, the processor 21 can read the digital document from the storage 25 and transmit the read digital document to the user device 10. As described above, the storage 25 may store the original document 25b, which is an original digital document, and the annotated document 25c, in which annotation information is added to the original digital document. The transmission request from the user apparatus 10 may include document type identification information for identifying whether the transmission of the original digital document or the digital document to which the annotation information is added is requested.
続いて、上下左右のいずれかが空白によって区切られた空白区切りセルを含む表におけるセルの認識について説明する。図6には、空白区切りセルを含む表の例として、横方向の罫線で区切られた領域内に、空白で区切られた12個のセルを含む表T41が示されている。表データ抽出部21bは、この種の表から画像ベース方式で表データを抽出する場合には、例えば表T41を含むオリジナル文書25b(又はその一部である表T41を含むページ)を画像ファイルに変換し、この画像ファイルを用いて表レイアウトの認識を行う。表データ抽出部21bは、例えば、デジタル文書から変換された画像ファイルにおいて罫線を検出し、この罫線間にある矩形の領域を仮セルとして認識する。図6の例では、罫線L1~L4がそれぞれ検出され、この罫線間の領域が仮セルT41a~T41cとして認識される。
Next, the recognition of cells in a table containing blank-separated cells whose top, bottom, left, and right are separated by blanks will be described. FIG. 6 shows, as an example of a table containing blank delimited cells, a table T41 containing twelve cells delimited by blanks in a region delimited by horizontal ruled lines. When the table data extraction unit 21b extracts table data from this type of table by an image-based method, for example, the original document 25b including the table T41 (or a page including the table T41 which is a part thereof) is used as an image file. Convert and recognize the table layout using this image file. For example, the table data extraction unit 21b detects a ruled line in an image file converted from a digital document, and recognizes a rectangular area between the ruled lines as a temporary cell. In the example of FIG. 6, the ruled lines L1 to L4 are detected, and the area between the ruled lines is recognized as the temporary cells T41a to T41c.
次に、表データ抽出部21bは、罫線間にある仮セルに、罫線ではなく空白で区切られている行や列が存在するか否かを判定する。例えば、表データ抽出部21bは、仮セルの中に横方向において互いから離間している複数のテキストが存在する場合に、当該仮セルの中に罫線ではなく空白で区切られた列があると認識することができる。仮セルT41bの横方向におけるテキストの配置に着目すると、仮セルT41bの内部に「建物及び構築物」というひとまとまりのテキストと、「8,994千円」というひとまとまりのテキストと、「44,049千円」というひとまとまりのテキストとが横方向において互いから離間して配置されているため、仮セルT41bの内部に空白で区切られた3つの列があることが認識される。同様に、仮セルT41a及び仮セルT41cにも空白で区切られた3つの列があることが認識される。
Next, the table data extraction unit 21b determines whether or not the temporary cells between the ruled lines have rows and columns separated by blanks instead of the ruled lines. For example, when the table data extraction unit 21b has a plurality of texts that are laterally separated from each other in the temporary cell, the temporary cell has columns separated by blanks instead of ruled lines. Can be recognized. Focusing on the arrangement of the text in the horizontal direction of the temporary cell T41b, the text of "buildings and structures", the text of "8,994,000 yen", and "44,049" are inside the temporary cell T41b. Since a group of texts of "thousand yen" are arranged so as to be separated from each other in the horizontal direction, it is recognized that there are three columns separated by blanks inside the temporary cell T41b. Similarly, it is recognized that the temporary cell T41a and the temporary cell T41c also have three columns separated by blanks.
また、仮セルT41bの縦方向におけるテキストの配置に着目すると、仮セルT41bの内部の最も左に配置された列に、「建物及び構築物」というひとまとまりのテキストと、「機械装置及び運搬具」というひとまとまりのテキストと、「その他有形固定資産」というひとまとまりのテキストと、「ソフトウェア」というひとまとまりのテキストとが縦方向において互いから離間して配置されているため、仮セルT41bの内部に空白で区切られた4つの行があることが認識される。
Focusing on the arrangement of texts in the vertical direction of the temporary cell T41b, the leftmost column inside the temporary cell T41b contains a set of texts "buildings and structures" and "mechanical devices and carriers". Since the text of "Other tangible fixed assets" and the text of "Software" are arranged vertically apart from each other, they are inside the temporary cell T41b. It is recognized that there are four lines separated by blanks.
以上のようにして、表データ抽出部21bは、仮セルT41bの内部に4行x3列の空白で区切られたセルが存在することを認識する。同様に、仮セルT41aの内部及び仮セルT41cの内部にはそれぞれ、1行x3列の空白で区切られたセルが存在することが認識される。表データ抽出部21bは、以上のようにして認識された空白で区切られたセル内に配置されている表データを例えばOCRによって検出する。
As described above, the table data extraction unit 21b recognizes that a cell separated by a blank of 4 rows × 3 columns exists inside the temporary cell T41b. Similarly, it is recognized that there are cells separated by blanks in 1 row × 3 columns inside the temporary cell T41a and inside the temporary cell T41c, respectively. The table data extraction unit 21b detects the table data arranged in the cells separated by the blanks recognized as described above by, for example, OCR.
計算部21eは、上記のようにして検出された空白で区切られたセルの表データに基づいて計算を実行することができる。計算部21eは、空白で区切られたセルの表データと罫線で区切られたセルの表データとを両方とも用いて計算を実行することもできる。注釈付加部21fは、計算部21eによって行われた計算の結果及び/又は内容を示す注釈情報を、空白で区切られたセルを含む表T41に付加して表示することができる。図6の表T41には、計算範囲を示す矢印と不一致マークが付加されている。
The calculation unit 21e can execute the calculation based on the table data of the cells separated by the blanks detected as described above. The calculation unit 21e can also execute the calculation using both the table data of the cells separated by the blank and the table data of the cells separated by the ruled line. The annotation addition unit 21f can add and display the annotation information indicating the result and / or the content of the calculation performed by the calculation unit 21e to the table T41 including the cells separated by blanks. In Table T41 of FIG. 6, an arrow indicating a calculation range and a mismatch mark are added.
図7から図9に示されている各表においても、図6と同様に、罫線で区画された領域の中に、空白で区切られたセルが存在している。表データ抽出部21bは、図6における空白で区切られたセルの検出と同様の手法で、図7から図9に示されている各表においても空白で区切られたセルを認識することができる。図7及び図8の各表には、空白で区切られたセル内に配置されている表データを計算した計算の結果及び/又は内容を示す注釈情報が付加されている。図9の表では、計算が縦及び横の両方向において実施されており、当該表にその計算の結果及び/又は内容を示す注釈情報が付加されている。
In each of the tables shown in FIGS. 7 to 9, similarly to FIG. 6, cells separated by blanks exist in the area divided by the ruled line. The table data extraction unit 21b can recognize the cells separated by blanks in each of the tables shown in FIGS. 7 to 9 by the same method as the detection of the cells separated by blanks in FIG. .. Annotated information indicating the result and / or content of the calculation of the table data arranged in the cells separated by blanks is added to each of the tables of FIGS. 7 and 8. In the table of FIG. 9, the calculation is performed in both the vertical and horizontal directions, and the table is supplemented with annotation information indicating the result and / or the content of the calculation.
続いて、図10を参照して、デジタル文書に含まれている2つ以上の表を結合する表結合処理について説明する。まずステップS11において、処理対象のデジタル文書に含まれている表が検出される。このステップS11の処理は、例えば、上記の表検出部21aにより行われる。
Subsequently, with reference to FIG. 10, a table joining process for joining two or more tables included in a digital document will be described. First, in step S11, the table included in the digital document to be processed is detected. The process of this step S11 is performed by, for example, the above-mentioned table detection unit 21a.
次に、ステップS12において、ステップS11で検出された表の各々についてテキストベース方式で表データが抽出され、ステップS13において、ステップS11で検出された表の各々について画像ベース方式で表データが抽出される。ステップS12とS13とは並行して行われてもよく、ステップS13がステップS12より前に行われてもよい。ステップS13では、罫線で区画された領域の中に配置されている罫線で区画されていないセル(空白で区切られているセル)が検出され、この空白で区切られているセル内に配置されている表データが抽出されてもよい。ステップS12及びステップS13の一方が省略されてもよい。
Next, in step S12, table data is extracted by a text-based method for each of the tables detected in step S11, and in step S13, table data is extracted by an image-based method for each of the tables detected in step S11. To. Steps S12 and S13 may be performed in parallel, and step S13 may be performed before step S12. In step S13, cells that are not divided by the ruled line (cells that are separated by a blank) that are arranged in the area divided by the ruled line are detected, and are arranged in the cells that are separated by the blank. Table data may be extracted. One of steps S12 and S13 may be omitted.
次に、ステップS14では、ステップS12で抽出された表データとステップS13で抽出された表データとの間に重複があるか否かがセルごとに判断され(整合性チェック)る。重複がある場合には、ステップS13において画像ベース方式で抽出された表データが採用され、採用された表データが当該表についての表データとしてストレージ25に記憶される。このため、表データを用いた計算処理や注釈付加処理が行われる場合には、画像ベース方式で抽出された表データが用いられる。画像ベース方式で抽出された表データがテキストベース方式で抽出された表データよりも優先されるのは、画像ベース方式で表データを抽出する場合、抽出対象となる表には少なくとも表を囲む線(垂直方向の外側境界線と水平方向の外側境界線)が明確に定義されており、したがって表の抽出としてより正確性が高いからである。そのように表を囲む線があると、表の周りのテキストから当該表をより正確に抽出することができる。ステップS14では、例えばステップS12で検出された表のデジタル文書内での位置と、ステップS13で検出された表のデジタル文書内での位置が一致している場合に、同じ表が重複して抽出されたと判定される。例えば、ステップS12及びステップS13の両方で、デジタル文書の第10ページの同じ位置か近接した位置に表があり、その表が同じか同程度の大きさを有している場合に、その2つの表は重複していると判定される。
Next, in step S14, it is determined for each cell whether or not there is duplication between the table data extracted in step S12 and the table data extracted in step S13 (consistency check). If there is duplication, the table data extracted by the image-based method in step S13 is adopted, and the adopted table data is stored in the storage 25 as table data for the table. Therefore, when the calculation process or the annotation addition process using the table data is performed, the table data extracted by the image-based method is used. When extracting table data by the image-based method, the table data extracted by the image-based method takes precedence over the table data extracted by the text-based method. This is because (vertical outer borders and horizontal outer borders) are clearly defined and therefore more accurate as a table extraction. Having such a line around the table allows the table to be extracted more accurately from the text around the table. In step S14, for example, when the position of the table detected in step S12 in the digital document and the position of the table detected in step S13 in the digital document match, the same table is duplicated and extracted. It is determined that it has been done. For example, in both steps S12 and S13, if there are tables at the same or close positions on page 10 of the digital document and the tables are the same or similar in size, the two. The table is determined to be duplicated.
ステップS14においては、ステップS12で検出された表とステップS13で検出された表とが同じ位置にある場合であっても、検出されたセルの数が異なる場合には、検出されたセルの数が多い方の表データを採用してもよい。例えば、図6に示されている表T41は、空白で区切られたセルを多数含んでいる。このような空白で区切られたセルを多く含む表に関しては、罫線に基づいてセルの検出を行う画像ベース方式では、空白区切りセルを検出処理が行われても一部のセルが正しく認識されない可能性がある。そこで、テキストベース方式での表データの検出が行われるステップS12において画像ベース方式での表データの検出が行われるステップS13よりも多くのセルが検出された場合には、このテキストベース方式で検出された表データが採用されてもよい。
In step S14, even if the table detected in step S12 and the table detected in step S13 are at the same position, if the number of detected cells is different, the number of detected cells is different. The table data with the largest number may be adopted. For example, Table T41 shown in FIG. 6 contains a large number of cells separated by blanks. For a table containing many cells separated by blanks, some cells may not be recognized correctly even if the cells separated by blanks are detected in the image-based method that detects cells based on the ruled line. There is sex. Therefore, if more cells are detected in step S12 in which the table data is detected by the text-based method than in step S13 in which the table data is detected by the image-based method, the cells are detected by this text-based method. The created table data may be adopted.
次に、ステップS15において、ステップS12又はS13において抽出された各表の表データが、各表の表識別情報と対応付けてストレージ25に記憶される。ステップS14においてステップS12で検出された表データとステップS13で検出された表データとが重複すると判定された場合には、ステップS13で検出された表データのみがストレージ25に記憶され、ステップS12で検出された表データは破棄されてもよい。
Next, in step S15, the table data of each table extracted in step S12 or S13 is stored in the storage 25 in association with the table identification information of each table. If it is determined in step S14 that the table data detected in step S12 and the table data detected in step S13 overlap, only the table data detected in step S13 is stored in the storage 25, and in step S12. The detected table data may be discarded.
次にステップS16において、処理対象のデジタル文書に含まれている表の各々について他の表と結合可能か否かが判断される。例えば、ステップS16では、デジタル文書に含まれている一の表のヘッダ情報と一致するヘッダ情報を持つ他の表が結合候補として選定される。例えば、ステップS11において検出された表に含まれる一の表の行ヘッダ情報及び列ヘッダ情報の少なくとも一方と一致する行ヘッダ情報又は列ヘッダ情報を有している他の表が、結合候補の表として選定される。ステップS16においては、選定された結合候補との結合を行うか否かがさらに判定される。例えば、デジタル文書に含まれる表とその表との結合候補として選定された表との間にテキストが存在しないか事前定義テキストのみが存在している場合に、その結合候補の表との結合を行うと判定される。ステップS16における判定処理は、上記の判定部21cにより行われてもよい。
Next, in step S16, it is determined whether or not each of the tables included in the digital document to be processed can be combined with other tables. For example, in step S16, another table having header information that matches the header information of one table included in the digital document is selected as a join candidate. For example, another table having row header information or column header information that matches at least one of the row header information and the column header information of one table included in the table detected in step S11 is a table of candidate joins. Is selected as. In step S16, it is further determined whether or not to perform the coupling with the selected coupling candidate. For example, if there is no text or only predefined text between a table contained in a digital document and a table selected as a candidate to join the table, join the table with the candidate join. It is determined to do. The determination process in step S16 may be performed by the determination unit 21c described above.
ステップS16においては、ステップS11で検出された表のうちの任意の2つの表の間にテキストが存在しないか事前定義テキストのみが存在している場合に、その2つの表を結合候補のペアとして選定してもよい。この場合、結合候補のペアとして選定された2つの表の間で行ヘッダ情報及び列ヘッダ情報の少なくとも一方が一致するか否かが判断され、一致する場合にその結合候補のペアが結合可能と判定される。
In step S16, if there is no text or only predefined text between any two of the tables detected in step S11, the two tables are used as a pair of candidate joins. You may choose. In this case, it is determined whether or not at least one of the row header information and the column header information matches between the two tables selected as the join candidate pair, and if they match, the join candidate pair can be joined. It is judged.
ステップS16において結合可能な表が特定されなかった場合には、表結合処理は終了する。ステップS16において結合可能な表の組が特定された場合には、その結合可能と判定された2つ(又は3つ以上)の表(表部分)を結合する。2つの表が結合されると、その2つの表が結合された結合表に、当該結合表を識別する表識別番号が割り当てられ、当該表識別番号と関連付けて結合前の2つの表の表データが記憶される。この表を結合する処理は、上記の結合部21dにより行われてもよい。ステップS17において、結合可能な表が全て結合されると、表結合処理は終了する。
If the table that can be joined is not specified in step S16, the table join process ends. When a set of tables that can be joined is specified in step S16, two (or three or more) tables (table portions) determined to be joinable are joined. When two tables are joined, the join table to which the two tables are joined is assigned a table identification number that identifies the join table, and the table data of the two tables before the join is associated with the table identification number. Is remembered. The process of joining this table may be performed by the above-mentioned joining portion 21d. In step S17, when all the tables that can be joined are joined, the table join process ends.
次に、図11を参照して、表に含まれる数値の計算の結果及び/又は内容を示す注釈情報を当該表に付加する注釈付加処理の流れについて説明する。図11に示されているデジタル文書についての注釈付加処理においては、処理開始時又は注釈付加処理の開始後の適切な時点で、当該デジタル文書に含まれる表の表データが抽出されて計算対象とすることが可能になっていることが想定されている。まず、ステップS21において、処理対象のデジタル文書に含まれる表の表データの中から数値を示す数値データが抽出される。次に、ステップS22において、ステップS21で抽出された数値に基づいて所定の計算が行われる。例えば、処理対象のデジタル文書に表T31に示されている賃借対照表が含まれている場合、ステップS21で抽出された数値が賃借対照表で想定されているルールに従って計算される。例えば、図4に示されているように、「流動資産」という行ヘッダと「流動資産合計」の行ヘッダとの間の行に含まれている「流動資産」の個別項目(「現金及び預金」等)に対応するセル内に配置されている数値が加算され、「流動資産合計」に対応する値が算出される。計算のルールは事前に定めておくことができる。例えば、行ヘッダに「合計」という用語が含まれている場合、その「合計」という用語が含まれている行ヘッダに先行する複数の行のうちひとまとまりの単位として表に含められている行に含まれる数値を加算するというルールを定めておくことができる。このルールを表T31に当てはめれば、「流動資産合計」という行ヘッダに先行する6つの行(「現金及び預金」の行から「貸倒引当金」までの行)が表T31においてひとまとまりに表されているため、これらの行に含まれている数値を列ごとに加算するというルールを定めておくことができる。
Next, with reference to FIG. 11, the flow of annotation addition processing for adding annotation information indicating the result and / or content of the calculation of the numerical values included in the table to the table will be described. In the annotation addition processing for the digital document shown in FIG. 11, the table data of the table included in the digital document is extracted and used as the calculation target at the start of the processing or at an appropriate time after the start of the annotation addition processing. It is assumed that it is possible to do so. First, in step S21, numerical data indicating a numerical value is extracted from the table data of the table included in the digital document to be processed. Next, in step S22, a predetermined calculation is performed based on the numerical value extracted in step S21. For example, if the digital document to be processed contains the balance sheet shown in table T31, the numbers extracted in step S21 are calculated according to the rules assumed in the balance sheet. For example, as shown in FIG. 4, the individual item of "current assets" ("cash and deposits") contained in the row between the row header "current assets" and the row header of "total current assets". The numerical values arranged in the cells corresponding to)) are added, and the value corresponding to the “total current assets” is calculated. The calculation rules can be set in advance. For example, if the row header contains the term "total", the row that is included in the table as a unit of the rows that precede the row header that contains the term "total". It is possible to set a rule to add the numerical values included in. Applying this rule to Table T31, the six rows preceding the row header "Total Current Assets" (the rows from "Cash and Deposits" to "Provision for Credit Losses") are grouped together in Table T31. Since it is represented, it is possible to set a rule to add the numerical values contained in these rows for each column.
次に、ステップS23において、ステップS22における計算の結果及び/又は内容を示す注釈情報が生成され、生成された注釈情報は表とともに表示される。例えば、図4に示されている賃借対照表について注釈付加処理を行う場合には、ステップS22において計算された「流動資産合計」の計算結果と、表に記載されている「流動資産合計」に対応する行に含まれるセル内の数値とが一致しているか否かが判定され、その判定の結果に応じて不一致マークS1又は一致マークS2が表示される。図4に示されている例では、「前連結会計年度(2018年3月31日)」の列の「流動資産合計」の行のセルにステップS22における計算結果と一致する数値が表示されているため、当該セルまたはその近傍に一致マークS2が表示される。これとは逆に、「当連結会計年度(2019年3月31日)」の列の「流動資産合計」の行のセルにステップS22における計算結果と一致しない数値が表示されているため、当該セルまたはその近傍に不一致マークS1が表示される。また、矢印A1~A10により計算範囲が示されている。このように、表に注釈情報が付加されることにより、注釈付加処理は終了する。
Next, in step S23, annotation information indicating the result and / or content of the calculation in step S22 is generated, and the generated annotation information is displayed together with the table. For example, when annotating the balance sheet shown in FIG. 4 is performed, the calculation result of “total current assets” calculated in step S22 and the “total current assets” described in the table are displayed. It is determined whether or not the numerical values in the cells included in the corresponding row match, and the mismatch mark S1 or the match mark S2 is displayed according to the result of the determination. In the example shown in FIG. 4, a numerical value that matches the calculation result in step S22 is displayed in the cell of the row of "Total current assets" in the column of "Previous consolidated fiscal year (March 31, 2018)". Therefore, the match mark S2 is displayed in or near the cell. On the contrary, the cell in the row of "Total current assets" in the column of "Current consolidated fiscal year (March 31, 2019)" displays a numerical value that does not match the calculation result in step S22. The mismatch mark S1 is displayed in or near the cell. Further, the calculation range is indicated by arrows A1 to A10. By adding the annotation information to the table in this way, the annotation addition process ends.
注釈付加処理は、デジタル文書に含まれる表の各々について行われてもよいし、デジタル文書に含まれる2つ以上の表を結合して生成された結合表について行われてもよい。つまり、本願明細書において開示される注釈付加処理は、デジタル文書に含まれる表が結合されることを前提としていない。このため、表結合システム1は、説明の便宜上、「表結合」システムという名称で説明されているが、必ずしも表の結合を行うものではない。例えば、図1に示されている表結合システム1は、デジタル文書に含まれている表の結合を行わずに注釈付加処理を行う場合には、判定部21c及び結合部21dの少なくとも一方の機能を備えなくともよい。この場合、表結合システム1は、デジタル文書に含まれる表を結合する機能を備えないが、その表について注釈付加処理を行うことができる。
The annotation addition process may be performed on each of the tables included in the digital document, or may be performed on the joined table generated by joining two or more tables included in the digital document. That is, the annotation processing disclosed herein does not presuppose that the tables contained in the digital document are combined. For this reason, the table join system 1 is described by the name of "table join" system for convenience of explanation, but does not necessarily perform table join. For example, in the table joining system 1 shown in FIG. 1, when the annotation addition processing is performed without joining the tables included in the digital document, the function of at least one of the determination unit 21c and the joining unit 21d is performed. It is not necessary to have. In this case, the table join system 1 does not have a function of joining the tables included in the digital document, but can perform annotation processing on the table.
続いて、図12及び図13を参照して、本願明細書において開示される発明の別の態様について説明する。表結合システム1のプロセッサ21は、デジタル文書に含まれる表を結合する機能に加えて、又は、デジタル文書に含まれる表を結合する機能に代えて、デジタル文書に含まれる表を分割する表分割部(不図示)として機能してもよい。例えば、表分割部は、図12に示されているように、1つの表としてデジタル文書に含まれているオリジナルの表の列のヘッダ情報に繰り返しパターンが存在するか否かを判定し、繰り返しパターンが存在すると判定した場合に、そのオリジナルの表を繰り返しごとにサブテーブルに分割する機能を実行してもよい。例えば、図12に示されている表T51においては、列のヘッダ情報としてC1~C5が3回繰り返し現れている。この場合、プロセッサ21は、この繰り返しの起点となっているC1が配置されている列ヘッダセルと繰り返しの終点となっているC5が配置されている列ヘッダセルとの間の罫線L11を切断位置として、この表を3つの表(サブテーブル)T51-1~T51-3に水平方向に分割してもよい。このように分割されたサブテーブルT51-1~T51-3の各々は、C1からC5の列ヘッダに対応する5列から成る単位繰り返しパターンを1つだけ含んでいる。図12においては、列ヘッダ情報に繰り返しパターンが現れる例を説明したが、繰り返しパターンは行ヘッダ情報に現れてもよい。この場合、元の表は、繰り返しパターンごとに垂直方向に分割される。
Subsequently, another aspect of the invention disclosed in the present specification will be described with reference to FIGS. 12 and 13. The processor 21 of the table join system 1 divides the table contained in the digital document in addition to the function of joining the tables contained in the digital document or in place of the function of joining the tables contained in the digital document. It may function as a unit (not shown). For example, as shown in FIG. 12, the table partitioning unit determines whether or not a repetition pattern exists in the header information of the columns of the original table included in the digital document as one table, and repeats. If it is determined that a pattern exists, the function of dividing the original table into subtables for each iteration may be executed. For example, in the table T51 shown in FIG. 12, C1 to C5 appear repeatedly three times as the header information of the column. In this case, the processor 21 uses the ruled line L11 between the column header cell in which C1 which is the starting point of the repetition is arranged and the column header cell in which C5 which is the ending point of the repetition is arranged as the cutting position. This table may be divided horizontally into three tables (sub-tables) T51-1 to T51-3. Each of the subtables T51-1 to T51-3 thus divided contains only one unit repeat pattern consisting of five columns corresponding to the column headers of C1 to C5. In FIG. 12, an example in which the repetition pattern appears in the column header information has been described, but the repetition pattern may appear in the row header information. In this case, the original table is vertically divided for each repeating pattern.
分割されたサブテーブルの各々には、各サブテーブルを識別する表識別情報が付与されてもよい。表データ抽出部21bは、この分割されたサブテーブルごとに表データを抽出し、抽出した表データをサブテーブルごとにストレージ25に記憶してもよい。各サブテーブルの表データは、各サブテーブルを識別する表識別情報と対応付けてストレージ25に記憶されてもよい。計算部21eは、分割されたサブテーブルごとに、各サブテーブルの表データに含まれる数値に基づいて計算を実行することができる。注釈付加部21fは、分割されたサブテーブルごとに、計算部21eによって行われた計算の結果及び/又は内容を示す注釈情報を当該表に付加して表示することができる。
Table identification information that identifies each subtable may be added to each of the divided subtables. The table data extraction unit 21b may extract table data for each of the divided sub-tables and store the extracted table data in the storage 25 for each sub-table. The table data of each sub-table may be stored in the storage 25 in association with the table identification information that identifies each sub-table. The calculation unit 21e can execute the calculation for each divided subtable based on the numerical value included in the table data of each subtable. The annotation addition unit 21f can add and display annotation information indicating the result and / or the content of the calculation performed by the calculation unit 21e for each divided subtable.
他方、図13に示されている表T61では、C1~C5が2回繰り返されているが、左から数えて1回目に現れるC3はGC2の列に含まれる複数のサブ列の先頭(最も左)に配置されているのに対して、左から数えて2回目に現れるC3はGC3がヘッダ情報として設定されている列に含まれる複数のサブ列の右端に配置されている。このため、表T61は、分割可能な繰り返しパターンを有していないと判定される。
On the other hand, in Table T61 shown in FIG. 13, C1 to C5 are repeated twice, but C3 appearing the first time counting from the left is the head of a plurality of sub-columns included in the column of GC2 (leftmost). ), While C3 appearing for the second time counting from the left is arranged at the right end of a plurality of sub-columns included in the column in which GC3 is set as header information. Therefore, it is determined that Table T61 does not have a repeatable repeat pattern.
テーブルを分割する単位となる繰り返しパターンは、列ヘッダセル又は行ヘッダセルに含まれるテキストを解析することで検出することができる。例えば、図14に示されている賃借対照表は、「資産の部」及び「負債の部」のそれぞれの列に、「科目」のテキストを含む列ヘッダセルと「金額」のテキストを含む列ヘッダセルを含んでいる。このため、表分割部は、賃借対照表において「科目」の列ヘッダセルと「金額」のテキストを含む列ヘッダセルとが繰り返されていると判定することができる。そして、表分割部は、この判定に応じて、賃借対照表を、「資産の部」の列から成るサブテーブルと、「負債の部」から成るサブテーブルと、に分割することができる。
The repeating pattern that is the unit for dividing the table can be detected by analyzing the text contained in the column header cell or the row header cell. For example, the balance sheet shown in FIG. 14 has a column header cell containing the text "Subject" and a column header cell containing the text "Amount" in the "Assets" and "Assets" columns, respectively. Includes. Therefore, the table partitioning unit can determine that the column header cell of "subject" and the column header cell containing the text of "amount" are repeated in the balance sheet. Then, the table division unit can divide the balance sheet into a subtable consisting of columns of "asset part" and a subtable consisting of "liability part" according to this determination.
また、プロセッサ21は、デジタル文書に含まれる表を構成するセルの各々について、ヘッダ情報を含むヘッダセルと、表データを含む表データセルのいずれに該当するかを推定する機能を備えてもよい。プロセッサ21は、表が、ヘッダセルのみを含む第1ヘッダ列と、表データセルのみを含む第1データ列と、ヘッダセルのみを含む第2ヘッダ列と、表データセルのみを含む第2データ列と、をこの順に含む場合に、この第1データ列と第2ヘッダ列との間を境界として当該表を2つのサブテーブルに分割する機能を備えてもよい。再び図14を参照する。図14の賃借対照表においては、左側に「資産の部」が配置され右側に「負債の部」が配置されている。「資産の部」は、ヘッダ情報のみを含む「科目」の列と、このヘッダ情報の内容(データ)のみを含む「金額」の列とに分かれている。同様に、「負債の部」もヘッダ情報のみを含む「科目」の列と、このヘッダ情報の内容(データ)のみを含む「金額」の列とに分かれている。よって、賃借対照表は、「資産の部」の「金額」の列と「負債の部」の「科目」の列との間を境界として2つのサブテーブルに分割することができる。このように繰り返しパターンを認識する方法以外の方法でも、デジタル文書に含まれている1つのテーブルを2つ以上のサブテーブルに分割することができる。
Further, the processor 21 may have a function of estimating whether each of the cells constituting the table included in the digital document corresponds to the header cell including the header information or the table data cell including the table data. In the processor 21, the table includes a first header column containing only a header cell, a first data column containing only a table data cell, a second header column containing only a header cell, and a second data column containing only a table data cell. , In this order, the table may be divided into two sub-tables with the boundary between the first data column and the second header column as a boundary. See FIG. 14 again. In the balance sheet of FIG. 14, the "asset section" is placed on the left side and the "account section" is placed on the right side. The "asset section" is divided into a "subject" column containing only header information and a "amount" column containing only the content (data) of this header information. Similarly, the "debt section" is also divided into a column of "subjects" containing only header information and a column of "amount" containing only the content (data) of this header information. Therefore, the balance sheet can be divided into two sub-tables with a boundary between the "Amount" column of the "Assets" section and the "Items" column of the "Assets" section. By a method other than the method of recognizing the repetition pattern as described above, one table included in the digital document can be divided into two or more sub-tables.
本明細書中で説明される処理及び手順が単一の装置、ソフトウェア、コンポーネント、モジュールによって実行される旨が説明されたとしても、そのような処理または手順は複数の装置、複数のソフトウェア、複数のコンポーネント、及び/又は複数のモジュールによって実行され得る。また、本明細書中で説明されるデータ、テーブル、又はデータベースが単一の記憶装置(ストレージやメモリ)に格納される旨説明されたとしても、そのようなデータ、テーブル、又はデータベースは、単一の装置に備えられた複数の記憶装置または複数の装置に分散して配置された複数の記憶装置に分散して格納され得る。さらに、本明細書において説明されるソフトウェアおよびハードウェアの要素は、それらをより少ない構成要素に統合して、またはより多い構成要素に分解することによって実現することも可能である。
Even if it is described that the processes and procedures described herein are performed by a single device, software, component, module, such processes or procedures may be performed by multiple devices, multiple software, multiple devices. Can be performed by a component of, and / or multiple modules. Also, even though it is described that the data, tables, or databases described herein are stored in a single storage device (storage or memory), such data, tables, or databases are simply. It may be distributed and stored in a plurality of storage devices provided in one device or in a plurality of storage devices distributed and arranged in a plurality of devices. Further, the software and hardware elements described herein can also be realized by integrating them into fewer components or by breaking them down into more components.
本明細書における「第1」、「第2」、「第3」などの表記は、構成要素を識別するために付するものであり、必ずしも、数、順序、もしくはその内容を限定するものではない。
Notations such as "first", "second", and "third" in the present specification are attached to identify components, and do not necessarily limit the number, order, or contents thereof. do not have.
本明細書において単数形で表される構成要素は、矛盾を生じさせない限り、複数形を含むものとする。
The components represented in the singular form in this specification shall include the plural form as long as they do not cause a contradiction.
相互参照
本出願は、2020年11月26日に出願された日本国特許出願2020-196428及び日本国特許出願2020-196431に基づく優先権を主張し、引用によりこれらの出願の全体を包含する。 Cross-reference This application claims priority under Japanese Patent Application 2020-196428 and Japanese Patent Application 2020-196431 filed on 26 November 2020 and includes all of these applications by citation.
本出願は、2020年11月26日に出願された日本国特許出願2020-196428及び日本国特許出願2020-196431に基づく優先権を主張し、引用によりこれらの出願の全体を包含する。 Cross-reference This application claims priority under Japanese Patent Application 2020-196428 and Japanese Patent Application 2020-196431 filed on 26 November 2020 and includes all of these applications by citation.
1 表結合システム
10 ユーザ装置
20 サーバ 1 Table join system 10 User equipment 20 Server
10 ユーザ装置
20 サーバ 1 Table join system 10 User equipment 20 Server
Claims (17)
- 一又は複数のプロセッサに、
デジタル文書から第1表部分及び前記第1表部分と異なる第2表部分を検出する表検出機能と、
前記第1表部分と前記第2表部分とが結合可能か判定する判定機能と、
前記第1表部分と前記第2表部分とが結合可能と判定された場合に、前記第1表部分と前記第2表部分とを結合する機能と、
を実行させる表結合プログラム。 For one or more processors
A table detection function that detects the first table part and the second table part different from the first table part from the digital document,
A determination function for determining whether the first table portion and the second table portion can be combined, and
A function of joining the first table portion and the second table portion when it is determined that the first table portion and the second table portion can be combined.
A table join program that runs. - 前記第1表部分は第1ページから抽出され、前記第2表部分は前記第1ページとは異なる第2ページから抽出される、
請求項1に記載の表結合プログラム。 The first table portion is extracted from the first page, and the second table portion is extracted from the second page different from the first page.
The table join program according to claim 1. - 前記一又は複数のプロセッサに、前記第1表部分に含まれる第1数値データ及び前記第2表部分に含まれる第2数値データに基づいて計算を実行する機能をさらに実行させる、
請求項1又は2に記載の表結合プログラム。 Further causing the one or more processors to perform a function of executing a calculation based on the first numerical data included in the first table portion and the second numerical data included in the second table portion.
The table join program according to claim 1 or 2. - 前記一又は複数のプロセッサに、前記第1数値データ及び前記第2数値データに基づく計算の結果及び当該計算の内容の少なくとも一方を示す注釈情報を前記デジタル文書に付加する機能をさらに実行させる、
請求項3に記載の表結合プログラム。 Further causing the one or more processors to perform a function of adding annotation information indicating at least one of the calculation result based on the first numerical data and the second numerical data and the content of the calculation to the digital document.
The table join program according to claim 3. - 前記第1表部分の第1ヘッダ情報と前記第2表部分の第2ヘッダ情報とが同一の場合に前記第1表部分と前記第2表部分とが結合可能と判定される、
請求項1から4のいずれか1項に記載の表結合プログラム。 When the first header information of the first table portion and the second header information of the second table portion are the same, it is determined that the first table portion and the second table portion can be combined.
The table joining program according to any one of claims 1 to 4. - 前記デジタル文書において前記第1表部分と前記第2表部分との間にテキストが存在しない場合、又は、前記第1表部分と前記第2表部分との間に予め定められた事前定義テキストのみが存在する場合に前記第1表部分と前記第2表部分とが結合可能と判定される、
請求項1から5のいずれか1項に記載の表結合プログラム。 Only when there is no text between the first table part and the second table part in the digital document, or only a predetermined predefined text between the first table part and the second table part. Is present, it is determined that the first table portion and the second table portion can be combined.
The table joining program according to any one of claims 1 to 5. - 前記デジタル文書は、非構造化文書である、
請求項1から6のいずれか1項に記載の表結合プログラム。 The digital document is an unstructured document,
The table join program according to any one of claims 1 to 6. - 一又は複数のプロセッサに、
非構造化文書から数値を抽出する機能と、
前記抽出された数値に基づく計算を実行させる機能と、
前記計算の結果及び当該計算の内容の少なくとも一方を示す注釈情報を前記非構造化文書に付加する機能と、
実行させる注釈付加プログラム。 For one or more processors
The ability to extract numbers from unstructured documents and
The function to execute the calculation based on the extracted numerical value and
A function to add annotation information indicating the result of the calculation and at least one of the contents of the calculation to the unstructured document, and
Annotation addition program to be executed. - 一又は複数のプロセッサに、
デジタル文書に含まれる表を構成するセルのパターンを解析する機能と、
前記表のいずれかの列又は行において繰り返しパターンが出現する場合に、前記表を繰り返しごとにサブテーブルに分割する機能と、
実行させる表分割プログラム。 For one or more processors
A function to analyze the patterns of cells that make up a table contained in a digital document, and
A function to divide the table into sub-tables for each iteration when a repeating pattern appears in any column or row of the table.
A table split program to be executed. - 一又は複数のプロセッサに、
デジタル文書に含まれる表を構成するセルがヘッダ要素及び表データ要素のいずれに該当するかを推定する機能と、
前記表が前記ヘッダ要素に該当するセルのみを含む第1ヘッダ列と、前記表データ要素に該当するセルのみを含む第1データ列と、前記ヘッダ要素に該当するセルのみを含む第2ヘッダ列と、前記表データ要素に該当するセルのみを含む第2データ列と、をこの順に含む場合に、前記第1データ列と前記第2ヘッダ列との間を境界として前記表を2つのサブテーブルに分割する機能と、
を実行させる表分割プログラム。 For one or more processors
A function to estimate whether the cells constituting the table included in the digital document correspond to the header element or the table data element, and
A first header column in which the table contains only cells corresponding to the header element, a first data column containing only cells corresponding to the table data element, and a second header column containing only cells corresponding to the header element. And a second data column containing only cells corresponding to the table data element, and when the first data column and the second header column are included as boundaries in this order, the table is used as two subtables. With the function to divide into
A table partition program that executes. - 一又は複数のプロセッサを備え、
前記一又は複数のプロセッサは、コンピュータ読み取り可能な命令を実行することにより、
デジタル文書から第1表部分及び前記第1表部分と異なる第2表部分を検出し、
前記第1表部分と前記第2表部分とが結合可能か判定し、
前記第1表部分と前記第2表部分とが結合可能と判定された場合に、前記第1表部分と前記第2表部分とを結合する、
表結合システム。 Equipped with one or more processors
The one or more processors may execute computer-readable instructions.
The first table part and the second table part different from the first table part are detected from the digital document.
It is determined whether the first table portion and the second table portion can be combined, and the result is determined.
When it is determined that the first table portion and the second table portion can be combined, the first table portion and the second table portion are combined.
Table join system. - 一又は複数のコンピュータプロセッサがコンピュータ読み取り可能な命令を実行することにより実行される表結合方法であって、
デジタル文書から第1表部分及び前記第1表部分と異なる第2表部分を検出する工程と、
前記第1表部分と前記第2表部分とが結合可能か判定する工程と、
前記第1表部分と前記第2表部分とが結合可能と判定された場合に、前記第1表部分と前記第2表部分とを結合する工程と、
を備える表結合方法。 A table join method performed by one or more computer processors executing computer-readable instructions.
The process of detecting the first table portion and the second table portion different from the first table portion from the digital document, and
A step of determining whether the first table portion and the second table portion can be combined,
A step of joining the first table portion and the second table portion when it is determined that the first table portion and the second table portion can be combined.
Table join method comprising. - 一又は複数のプロセッサを備え、
前記一又は複数のプロセッサは、コンピュータ読み取り可能な命令を実行することにより、
デジタル文書をサーバにアップロードし、
前記デジタル文書から抽出された数値に基づく計算の実行結果及び当該計算の内容の少なくとも一方を示す注釈情報が付加された注釈付き文書を前記サーバから取得し、前記注釈付き文書を表示する、
ユーザ装置。 Equipped with one or more processors
The one or more processors may execute computer-readable instructions.
Upload the digital document to the server,
Annotated document to which the execution result of the calculation based on the numerical value extracted from the digital document and the annotation information indicating at least one of the contents of the calculation is added is acquired from the server, and the annotated document is displayed.
User device. - 一又は複数のコンピュータプロセッサがコンピュータ読み取り可能な命令を実行することにより実行される注釈付加方法であって、
非構造化文書から数値を抽出する工程と、
前記抽出された数値に基づく計算を実行させる工程と、
前記計算の結果及び前記計算の内容の少なくとも一方を示す注釈情報を前記非構造化文書に付加する工程と、
を備える注釈付加方法。 Annotation method performed by one or more computer processors executing computer-readable instructions.
The process of extracting numbers from unstructured documents and
The process of executing the calculation based on the extracted numerical values, and
A step of adding annotation information indicating the result of the calculation and at least one of the contents of the calculation to the unstructured document, and
Annotation method with. - 一又は複数のコンピュータプロセッサがコンピュータ読み取り可能な命令を実行することにより実行される表分割方法であって、
デジタル文書に含まれる表を構成するセルのパターンを解析する工程と、
前記表のいずれかの列又は行において繰り返しパターンが出現する場合に、前記表を繰り返しごとにサブテーブルに分割する工程と、
を備える表分割方法。 A method of table partitioning performed by one or more computer processors executing computer-readable instructions.
The process of analyzing the patterns of cells that make up a table contained in a digital document, and
A step of dividing the table into subtables for each iteration when a repeating pattern appears in any column or row of the table.
A table partitioning method comprising. - 一又は複数のコンピュータプロセッサがコンピュータ読み取り可能な命令を実行することにより実行される表分割方法であって、
デジタル文書に含まれる表を構成するセルがヘッダ要素及び表データ要素のいずれに該当するかを推定する工程と、
前記表が前記ヘッダ要素に該当するセルのみを含む第1ヘッダ列と、前記表データ要素に該当するセルのみを含む第1データ列と、前記ヘッダ要素に該当するセルのみを含む第2ヘッダ列と、前記表データ要素に該当するセルのみを含む第2データ列と、をこの順に含む場合に、前記第1データ列と前記第2ヘッダ列との間を境界として前記表を2つのサブテーブルに分割する工程と、
を実行させる表分割方法。 A method of table partitioning performed by one or more computer processors executing computer-readable instructions.
The process of estimating whether the cells constituting the table included in the digital document correspond to the header element or the table data element, and
A first header column in which the table contains only cells corresponding to the header element, a first data column containing only cells corresponding to the table data element, and a second header column containing only cells corresponding to the header element. And a second data column containing only cells corresponding to the table data element, and when the first data column and the second header column are included as boundaries in this order, the table is used as two subtables. And the process of dividing into
How to split a table to execute. - 電子的な注釈付き非構造化文書の作成方法であって、
非構造化文書を取得する工程と、
前記非構造化文書から第1表部分及び前記第1表部分と異なる第2表部分を検出する工程と、
前記第1表部分と前記第2表部分とが結合可能か判定する工程と、
前記第1表部分と前記第2表部分とが結合可能と判定された場合に、前記第1表部分と前記第2表部分とを結合する工程と、
前記第1表部分に含まれる第1数値データ及び前記第2表部分に含まれる第2数値データに基づいて計算を実行する工程と、
前記第1数値データ及び前記第2数値データに基づく計算の結果及び当該計算の内容の少なくとも一方を示す注釈情報を前記非構造化文書に付加することによって前記注釈付き非構造化文書を生成する工程と、を含む方法。 How to create an electronically annotated unstructured document
The process of acquiring unstructured documents and
A step of detecting a first table portion and a second table portion different from the first table portion from the unstructured document, and
A step of determining whether the first table portion and the second table portion can be combined,
A step of joining the first table portion and the second table portion when it is determined that the first table portion and the second table portion can be combined.
A step of executing a calculation based on the first numerical data included in the first table portion and the second numerical data included in the second table portion.
A step of generating the annotated unstructured document by adding annotation information indicating at least one of the first numerical data, the calculation result based on the second numerical data, and the content of the calculation to the unstructured document. And how to include.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2022565026A JPWO2022113378A1 (en) | 2020-11-26 | 2020-12-25 |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2020-196428 | 2020-11-26 | ||
JP2020196431 | 2020-11-26 | ||
JP2020196428 | 2020-11-26 | ||
JP2020-196431 | 2020-11-26 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022113378A1 true WO2022113378A1 (en) | 2022-06-02 |
Family
ID=81754141
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2020/048664 WO2022113378A1 (en) | 2020-11-26 | 2020-12-25 | Table combining program, table combining system, and table combining method |
Country Status (2)
Country | Link |
---|---|
JP (1) | JPWO2022113378A1 (en) |
WO (1) | WO2022113378A1 (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS6383861A (en) * | 1986-09-27 | 1988-04-14 | Fuji Xerox Co Ltd | Automatic processing system for numerical formula |
JPH06318201A (en) * | 1993-05-07 | 1994-11-15 | Canon Inc | Table dividing method |
JP2020155054A (en) * | 2019-03-22 | 2020-09-24 | 三菱重工業株式会社 | Table information reading device, table information reading method and program |
JP2020177425A (en) * | 2019-04-17 | 2020-10-29 | 富士ゼロックス株式会社 | Information processor and program |
-
2020
- 2020-12-25 WO PCT/JP2020/048664 patent/WO2022113378A1/en active Application Filing
- 2020-12-25 JP JP2022565026A patent/JPWO2022113378A1/ja active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS6383861A (en) * | 1986-09-27 | 1988-04-14 | Fuji Xerox Co Ltd | Automatic processing system for numerical formula |
JPH06318201A (en) * | 1993-05-07 | 1994-11-15 | Canon Inc | Table dividing method |
JP2020155054A (en) * | 2019-03-22 | 2020-09-24 | 三菱重工業株式会社 | Table information reading device, table information reading method and program |
JP2020177425A (en) * | 2019-04-17 | 2020-10-29 | 富士ゼロックス株式会社 | Information processor and program |
Also Published As
Publication number | Publication date |
---|---|
JPWO2022113378A1 (en) | 2022-06-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11244208B2 (en) | Two-dimensional document processing | |
US11501061B2 (en) | Extracting structured information from a document containing filled form images | |
US9697193B2 (en) | Associating captured image data with a spreadsheet | |
US9208379B2 (en) | Image processing apparatus, image processing method, image processing system, and storage medium storing program | |
US8782516B1 (en) | Content style detection | |
CN110175609B (en) | Interface element detection method, device and equipment | |
US9286526B1 (en) | Cohort-based learning from user edits | |
US11715318B2 (en) | Systems and methods for spatial-aware information extraction from electronic source documents | |
US10489427B2 (en) | Document classification system, document classification method, and document classification program | |
JP6529254B2 (en) | INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, PROGRAM, AND STORAGE MEDIUM | |
JP2006309347A (en) | Method, system, and program for extracting keyword from object document | |
US20160055254A1 (en) | Method and System for Click-Thru Capability in Electronic Media | |
JPWO2014068770A1 (en) | Data extraction method, data extraction device and program thereof | |
WO2022113378A1 (en) | Table combining program, table combining system, and table combining method | |
US12045280B2 (en) | Method and system for facilitating keyword-based searching in images | |
US9378428B2 (en) | Incomplete patterns | |
CN116384344A (en) | Document conversion method, device and storage medium | |
CN114399626B (en) | Image processing method, apparatus, computer device, storage medium, and program product | |
CN114581934A (en) | Test paper image processing method, device and equipment | |
CN113486171A (en) | Image processing method and device and electronic equipment | |
JP2007280413A (en) | Automatic input device of financial statement | |
KR102573063B1 (en) | Digitized reference book provision system and method | |
WO2023042270A1 (en) | Character recognition program, character recognition system, and character recognition method | |
Correll | Improving Visual Statistics | |
Ta et al. | Check for updates A Table Extraction Solution for Financial Spreading |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20963641 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2022565026 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20963641 Country of ref document: EP Kind code of ref document: A1 |