WO2024084539A1 - Table recognition device and method - Google Patents
Table recognition device and method Download PDFInfo
- Publication number
- WO2024084539A1 WO2024084539A1 PCT/JP2022/038526 JP2022038526W WO2024084539A1 WO 2024084539 A1 WO2024084539 A1 WO 2024084539A1 JP 2022038526 W JP2022038526 W JP 2022038526W WO 2024084539 A1 WO2024084539 A1 WO 2024084539A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- ruled
- character string
- string
- character
- ruled line
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 62
- 230000010354 integration Effects 0.000 claims description 43
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 235000019197 fats Nutrition 0.000 description 34
- 238000010586 diagram Methods 0.000 description 14
- 238000004364 calculation method Methods 0.000 description 12
- 235000021003 saturated fats Nutrition 0.000 description 12
- 238000012545 processing Methods 0.000 description 9
- 235000010692 trans-unsaturated fatty acids Nutrition 0.000 description 7
- 230000000694 effects Effects 0.000 description 4
- 239000013598 vector Substances 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000012015 optical character recognition Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000002195 synergetic effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/412—Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
Definitions
- the present disclosure relates to a table recognition device and method for performing character recognition from image information of a table-format document.
- the table's borders are extracted and the rows and columns of the table are separated into multiple areas (multiple border frames) by the borders. Character recognition is then performed for each border frame, and the character recognition results are stored separately. For this reason, if a single item name or item value is written across multiple border frames, such as when the character string within a border frame that is to become an item name or item value exceeds the border frame and spills over into an adjacent border frame, the item name or item value (character string) may not be recognized.
- adjacent ruled frames may have different types of lines, and the ruled frame information of adjacent ruled frames may differ.
- the preset conditions are not met. Therefore, it is not possible to recognize a character string that is written across multiple ruled frames that do not meet the conditions.
- the present disclosure has been made to solve the above-mentioned problems, and aims to make it possible to accurately recognize character strings that span multiple ruled frames without relying on ruled frame information.
- the table recognition device of the present disclosure is A table recognition device that recognizes character strings described in a table-formatted document from image information of the table-formatted document, a character recognition unit that recognizes character strings written within each of a plurality of ruled frames provided in the table format document; a single character string that is a character string recognized for a target ruled frame that is a target ruled frame among the plurality of ruled frames, and a concatenated character string that is a concatenation of a character string recognized for a ruled frame other than the target ruled frame and the single character string,
- the apparatus further includes a ruled line frame integration determination unit that determines the single character string or the concatenated character string that has a higher degree of match with the matching character string to be written in the table format document as an integrated character string that belongs to the target ruled line frame.
- the table recognition method of the present disclosure comprises: A table recognition method for recognizing a character string described in a table-formatted document from image information of the table-formatted document, comprising the steps of: a character recognition unit that recognizes character strings written within each of a plurality of ruled frames provided in the table format document; a single character string that is a character string recognized for a target ruled line frame that is a target ruled line frame among the plurality of ruled line frames, and a concatenated character string that is a concatenation of a character string recognized for a ruled line frame other than the target ruled line frame and the single character string,
- the character string or the concatenated character string that matches the matching character string to be written in the table format document is determined as an integrated character string that belongs to the target ruled frame.
- FIG. 1 is a functional configuration diagram illustrating a configuration of a table recognition device according to a first embodiment.
- 2 is a diagram illustrating an example of a table-format document to be recognized by the table recognition device in the first embodiment.
- FIG. 2 is a diagram illustrating an example of a knowledge database according to the first embodiment.
- FIG. 2 is a hardware configuration diagram of the table recognition device according to the first embodiment.
- 4 is a flowchart showing an operation sequence of the table recognition device in the first embodiment.
- 11 is a flowchart illustrating an operation sequence of a ruled line frame integration determination unit according to the first embodiment. 4 is a diagram illustrating an example of the operation of the table recognition device in the first embodiment.
- FIG. FIG. 11 is a functional configuration diagram showing the configuration of a table recognition device in embodiment 2.
- FIG. 13 is a diagram illustrating an example of a table structure knowledge database in the second embodiment. 13 is a flowchart showing an operation sequence of a ruled line integration determination unit in the second embodiment. 13 is a diagram illustrating an example of the operation of the table recognition device in embodiment 3. FIG.
- Fig. 1 is a functional block diagram showing the configuration of a table recognition device 100 in the embodiment 1.
- the table recognition device 100 is made up of a table structure recognition unit 1, a character recognition unit 2, a ruled line frame comprehensive determination unit 3, and a knowledge database 4.
- FIG. 2 is a diagram showing an example of a tabular document to be recognized by the table recognition device 100 in the first embodiment.
- the item name "Item A” is written, in the second column “Item B” is written, and in the third column “Item C” is written.
- the second row and subsequent rows of the table contain item values belonging to each item name.
- characters may protrude from the ruled frame or may touch the ruled frame.
- FIG. 2 is a diagram showing an example of a tabular document to be recognized by the table recognition device 100 in the first embodiment.
- the item value "Total Fat” belonging to the item name "Item A” is in a state where the character “at” protrudes into the ruled frame of the item name "Item B". Also, the item value "Saturated Fat” belonging to the item name "Item B” is in a state where the initial character "S” touches the ruled frame.
- the table structure recognition unit 1 extracts lines from the image information of a table-format document and recognizes the table structure.
- the table structure is made up of multiple areas (i.e., multiple ruled frames) that separate the rows and columns of the table with lines.
- the method of recognizing the table structure can be, for example, a method based on edge histograms. Specifically, from the image information of a tabular document, edges in two directions are obtained near the boundaries of white pixel clusters inside the table area (table area).
- a white pixel cluster is a white area surrounded by a border frame of a color other than white. Then, based on the edge histograms obtained from each of the obtained edges in the two directions, partial information of the borders is obtained. Furthermore, based on the obtained partial information of the borders, border information of the table area is obtained and the table structure is recognized.
- the method of recognizing the table structure is not limited to this. For example, various methods can be used as long as they can obtain information on the row and column structure of a table.
- the character recognition unit 2 recognizes the character string within the ruled frame, for example, using optical character recognition (OCR) technology. Note that the method for recognizing the character string within the ruled frame is not limited to OCR, and other methods may also be used.
- OCR optical character recognition
- the ruled line frame integration determination unit 3 determines which ruled line frames should be integrated according to the degree of matching between the character string recognized by the character recognition unit 2 and the matching character string registered in the knowledge database 4. In other words, it determines whether the character string in each ruled line frame should be linked to the character strings in the adjacent ruled line frames on the left and right.
- the degree of matching can be a likelihood.
- the likelihood is a value that indicates the "likelihood" that an arbitrary character string is estimated to belong to a certain character string group.
- the likelihood can be the standardized edit distance between two character strings.
- the standardized edit distance is a value obtained by dividing the edit distance by the length of the longer character string.
- the edit distance is the minimum number of operations required to transform one character string into the other character string by inserting, deleting, or replacing one character.
- the likelihood may be read as the degree of matching between two character strings. Then, based on the judgment result, the character strings within the ruled frame are concatenated, and either the concatenated character string (concatenated character string) or the single character string (unconcatenated character string) is output as an integrated character string, which is the final integration judgment result.
- the knowledge database 4 defines the item name and the matching string, which is a string to be written as the item value belonging to the item name.
- FIG. 3 shows an example of the contents of the matching string in the knowledge database 4.
- the matching string is not limited to one word, but may be a phrase or a sentence consisting of multiple words.
- the IDX in FIG. 3 is an index number that is individually assigned to each matching string. This IDX is used to specify the matching string when referring to the matching string in the knowledge database 4. Specifically, in FIG.
- index numbers A1, A2, ..., A9 are assigned to each item value belonging to the item name "Item A" in order.
- the matching string is not limited to the string shown in FIG. 3, and can be set arbitrarily.
- a so-called wildcard representing any character or number can be set for the item value of the knowledge database 4.
- [*NUM*] means a wildcard, and any number can be substituted for the wildcard.
- the amount of calculation required to refer to a matching character string for each character or number can be reduced.
- the recognized character string may be replaced with a character string of an item name or an item value registered in the knowledge database 4.
- a character string of an item name or an item value that is not registered may be excluded from the recognition target as a result of erroneous recognition. Specific conditions for replacement will be described later.
- Fig. 4 is a hardware configuration diagram of the table recognition device 100.
- the table recognition device 100 has a processor 101, a memory 102, an external storage device 103, and an input/output interface 104.
- the processor 101 controls the entire table recognition device 100.
- the processor 101 is a CPU (Central Processing Unit) or an FPGA (Field Programmable Gate Array).
- the processor 101 may be a multiprocessor.
- the table recognition device 100 may also have a processing circuit.
- the memory 102 is the main storage device of the table recognition device 100.
- the memory 102 is a RAM (Random Access Memory).
- the external storage device 103 is an auxiliary storage device of the table recognition device 100.
- the external storage device 103 is a HDD (Hard Disk Drive) or an SSD (Solid State Drive).
- the input/output interface 104 is an interface that transmits and receives data to and from an external device connected to the table recognition device 100.
- the input/output interface 104 is a NIC (Network Interface Controller).
- the external device is an image scanner, a display, etc. Note that illustration of the external device is omitted.
- the processor 101 reads the table recognition program stored in the external storage device 103 into the memory 102, and the processor 101 executes the program, thereby realizing each process of the table recognition method.
- the external storage device 103 holds the program and data for realizing the table recognition method of the first embodiment.
- the table recognition program may be provided over a network, or may be provided by being recorded on a computer-readable recording medium. That is, the table recognition program may be provided, for example, as a program product.
- the input/output interface 104 receives image information of a tabular document from an external device such as an image scanner, and outputs the table recognition results to an external device such as a display.
- FIG. 5 is a flowchart showing the operation sequence of table recognition device 100 in embodiment 1.
- the contents of the first row of a table i.e., the row in which the item names are written
- a method for integrating only the ruled frames of item values from the second row onwards will be described.
- the second row which is the first row of item values, will be considered as the new first row in the explanation.
- step S1 the table structure recognition unit 1 extracts ruled lines from the image information of the tabular document and recognizes the table structure consisting of multiple ruled line frames. Furthermore, the table structure recognition unit 1 obtains the number of rows in the table of the tabular document and the number of columns in each row from the recognized table structure (step S1).
- step S2 the character recognition unit 2 recognizes the characters within the ruled frame that the table structure recognition unit 1 recognized in step S1 (step S2).
- the ruled line frame integration determination unit 3 refers to the knowledge database 4 and determines which ruled line frames to integrate based on the character recognition results obtained in step S2. Then, based on the determination result, it concatenates the character strings in adjacent ruled line frames and outputs a concatenated character string, which is the concatenated character string, or a single character string, which is a single character string that is not concatenated (step S3).
- FIG. 6 is a flowchart showing the sequence of operations performed by the ruled frame integration determination unit 3 in step S3.
- the " ⁇ " in the diagram represents the process of substituting the value or element on the right side for the variable on the left side.
- the value held by the variable may be indicated by the symbolic name of the variable.
- the operations are performed for a table with the upper left corner as the origin.
- step S301 the variable i, which indicates the position of a row in the table, is assigned the value 1 (step S301).
- step S302 it is confirmed whether the variable i is equal to or less than the number of rows in the table. If the variable i is equal to or less than the number of rows in the table (Yes in step S302), the process proceeds to step S303. If the variable i is greater than the number of rows in the table (No in step S302), the process ends (END) since all rows have been evaluated.
- step S303 assign 1 to the variable j, which represents the position of the column in the table (step S303).
- step S304 it is confirmed whether variable j is less than or equal to the total number of ruled frames for item values belonging to row i (hereinafter, the number of items) (step S304). If variable j is less than or equal to the number of items (Yes in step S304), the process proceeds to step S305. If variable j exceeds the number of items (No in step S304), the process proceeds to step S316.
- step S305 it is confirmed whether there is a character recognition result within the ruled frame in the jth column of the ith row (step S305). If there is a character recognition result within the ruled frame in the jth column of the ith row (Yes in step S305), the process proceeds to step S306. If there is no character recognition result within the ruled frame in the jth column of the ith row (No in step S305), the process proceeds to step S315.
- step S306 0 is assigned to the variable k (step S306).
- step S307 it is confirmed whether the value of variable k is less than or equal to the number of items minus the value of variable j and 1 (step S307). If the value of variable k is less than or equal to the number of items minus the value of variable j and 1 (Yes in step S307), the process proceeds to step S308. If the value of variable k is greater than the number of items minus the value of variable j and 1 (No in step S307), the process proceeds to step S313.
- step S308 it is confirmed whether there is a character recognition result within the ruled frame in the (j+k)th column of the ith row (step S308). If there is a character recognition result within the ruled frame in the (j+k)th column of the ith row (Yes in step S308), the process proceeds to step S309. If there is no character recognition result within the ruled frame in the (j+k)th column of the ith row (No in step S308), the process proceeds to step S313.
- step S309 the ruled frames from the jth column to the (j+k)th column are integrated, and the character strings in the ruled frames are concatenated to obtain a concatenated character string.
- the knowledge database 4 is referred to, and the matching character strings belonging to the jth item name (i.e., the jth column) in the knowledge database 4 are sequentially read out, the likelihood [j+k,j] that the concatenated character string belongs to the item name in the jth column is calculated, and the likelihood L[j+k,j] is substituted for the variable L1 (step S309).
- the likelihood [j+k,j] is calculated, for example, as follows: First, one matching string is read out from one or more matching strings registered in the knowledge database 4 as an item value belonging to the j-th item name (i.e., the j-th column) using the index number IDX as a key. Next, the standardized edit distance NED between the concatenated string obtained by concatenating the strings in the ruled box from the j-th column to the (j+k)th column and the matching string read out using the index number IDX as a key is calculated. The standardized edit distance NED is calculated for each matching string registered in the knowledge database 4. The standardized edit distance NED may be calculated for all matching strings registered in the knowledge database 4, or may be calculated for some matching strings.
- the minimum value NED MIN is calculated from the one or more calculated standardized edit distances NED. Then, the value obtained by subtracting the minimum value NED MIN from 1 can be calculated as the likelihood [j+k,j].
- the likelihood represents the degree of match between two strings when the most similar string is selected from the matching strings registered in the knowledge database 4 for the concatenated string. When the degree of match between two character strings is high, the number of times the character strings are transformed is small, and the standardized edit distance is small.
- the likelihood [j+k,j] indicates a high value (i.e., a value close to 1)
- the likelihood [j+k,j] indicates a low value (i.e., a value close to 0).
- the likelihood may be calculated based on the output of a trained model trained in the knowledge database 4 using a known machine learning method such as DNN (Deep Neural Network).
- the trained model can be created from string data obtained from a large amount of tabular documents. Specifically, a large amount of string data is used to randomly extract multiple strings from the large amount of string data, and these strings are concatenated to generate a concatenated string. Next, the likelihood (e.g., standardized edit distance) between the input string data, which is the generated concatenated string, and the matching string registered in the knowledge database 4 is calculated. Next, the likelihood corresponding to each input string data is assigned as a correct answer label (or ranking) and used as training data.
- DNN Deep Neural Network
- the trained model can be created by machine learning using the input string data and the training data so that the estimated likelihood, which is the output of the trained model, matches the correct answer label. Since the estimated likelihood is output by inputting the concatenated string into the trained model, the likelihood can be directly calculated without referring to the knowledge database 4 or without using the knowledge database 4. This is particularly effective when there is a large number of matching strings registered in the knowledge database 4, and it can reduce the amount of calculation required to refer to the knowledge database 4 and the amount of memory required to store the matching strings in the knowledge database 4.
- step S310 to compare the concatenated string obtained in step S309 with another concatenated string obtained by concatenating a string in a ruled box adjacent to the concatenated string, the ruled boxes from column j to (j+k+1) are merged, the strings in the ruled boxes are concatenated, and the concatenated string is obtained. Then, as in the process of step S309, the knowledge database 4 is referenced to calculate the likelihood [j+k+1,j] that the concatenated string belongs to the jth item name, and the likelihood L[j+k+1,j] is assigned to variable L2 (step S310).
- step S311 it is confirmed whether the value of variable L1 is less than or equal to the value of variable L2 (step S311). If the value of variable L1 is less than or equal to the value of variable L2 (Yes in step S311), the likelihood of concatenating another string to the concatenated string is higher than the likelihood of the concatenated string obtained in step S309, so the process proceeds to step S312. If the value of L1 is greater than the value of L2 (No in step S311), the process proceeds to step S313.
- step S312 1 is added to the variable k (step S312), and the process proceeds to step S307.
- step S313 it is confirmed whether the likelihood L[j+k, j] calculated in step S309 is equal to or greater than a predetermined threshold T1 (step S313). If the likelihood L[j+k.j] is equal to or greater than the predetermined threshold T1 (Yes in step S313), the process proceeds to step S314. If the likelihood [j+k,j] is less than the predetermined threshold T1 (No in step S313), the process proceeds to step S315.
- the predetermined threshold T1 is a threshold for suppressing (cutting off) an excessive increase in the number of ruled frame integration candidates C[j].
- the predetermined threshold T1 can be preset to 0.5, but is not limited to this value.
- step S314 the concatenated string obtained by concatenating the strings in the ruled frames from the jth column to the (j+k)th column as a candidate for merging the ruled frames C[j], and the row number and column number of each ruled frame are stored, for example, in a memory MEM (not shown) (step S314).
- step S315 1 is added to the variable j (step S315), and the process proceeds to step S304.
- step S316 it is determined whether there is a character recognition error in the character string to be concatenated, and if it is determined that there is a character recognition error, the character string to be concatenated may be replaced with the matching character string with the highest degree of match from among the item values registered in the table knowledge database 4.
- step S317 among the merging candidates C[j] for overlapping ruled frames, candidates whose likelihood of belonging to the item name of the overlapping merging candidate is less than a predetermined threshold T1 are discarded from the memory MEM (step S317).
- a predetermined threshold T1 For example, a predetermined number of candidates may be left in order of decreasing likelihood.
- step S318 1 is added to the variable i (step S318), and the process proceeds to step S302. Note that it is not necessary to perform the series of processes from step S302 to step S318 on all ruled frames in the table-format document. For example, if it is clear during the process that all subsequent ruled frame contents will be blank, or if the user determines that merging of ruled frames is not necessary, the series of processes described above may be discontinued.
- FIG. 7 is a diagram for explaining an example of the operation of the table recognition device of the first embodiment.
- FIG. 7(a) is an example of a tabular document to be recognized. In the tabular document shown in FIG. 7(a), "Total Fat” is written as an item value belonging to the item name "Item A”, "Saturated Fat” is written as an item value belonging to the item name "Item B”, and "25g” and "9g” are written as item values belonging to the item name "Item C”.
- FIG. 7(b) is an example of a table structure recognition result for FIG. 7(a).
- FIG. 7(c) is an example of a character recognition result for FIG. 7(b).
- FIG. 7(d) is an example of a character string recognition result obtained by integrating the ruled frame of FIG. 7(c).
- the table positions of the item names "Item A”, “Item B”, and “Item C” are known, and that the character recognition of each item name has been performed correctly.
- the table structure recognition unit 1 recognizes ruled frame 501 to ruled frame 509.
- character recognition unit 2 recognizes character strings within each area from ruled frame 501 to ruled frame 509. Then, character strings 510 to 517 are obtained as the character recognition results.
- the item value "Total Fat” belonging to the item name "Item A” exceeds the ruled frame in the second row and first column, and protrudes into the adjacent ruled frame in the second row and second column. Therefore, the character recognition result at this point is divided into a character string 513 ("Total F") and a character string 514 ("at”). That is, the character string 513 ("Total F") is recognized as the character string in the ruled frame 504 belonging to the item name "Item A”. Furthermore, the character string 514 ("at”) is erroneously recognized as the character string in the ruled frame 505 belonging to the item name "Item B".
- the item value "Saturated Fat” belonging to the item name "Item B” is within the ruled frame 508, but the initial "S” touches the vertical rule. Therefore, a character recognition error occurs due to the influence of the vertical rule, and the "S” changes to a "6". Therefore, the character recognition result at this point is erroneously recognized as the character string 516 ("6oaturated fat"). Furthermore, because the item value "9g” belonging to the item name "Item C” is entered left justified, it is also necessary to determine whether the item values "6oaturated Fat” and "9g” are a continuous character string.
- the ruled line frame integration judgment unit 3 refers to the matching strings (i.e., strings that should be written as item values) registered in the knowledge database 4, and judges which ruled line frames should be integrated based on the likelihood that the item value in each ruled line frame (i.e., the string obtained by character recognition) belongs to the item name (i.e., the minimum standardized edit distance minus 1). In other words, it judges whether the string in each ruled line frame should be concatenated with the strings in the adjacent left and right ruled line frames.
- the concatenated string which is the concatenated string, or the single string, which is a single string that is not concatenated, as the integrated string, which is both the final integration judgment result and the recognition result.
- variable L1 is the likelihood between the character string 513 "Total F” in the ruled box 504 and the item value "Total Fat” belonging to the item name "Item A” in the knowledge database 4, and can be calculated from the standardized edit distance NED between the character string 513 "Total F” and the matching character string “Total Fat”.
- variable L2 is the likelihood between the concatenated character string "Total Fat” obtained by concatenating ruled box 504 and ruled box 505 and the item value "Total Fat” belonging to the item name "Item A” in knowledge database 4, and can be calculated from the standardized edit distance NED between the concatenated character string "Total Fat” and the matching character string "Total Fat”.
- variable L1 and variable L2 are compared (step S311).
- variable L1 is the likelihood between the concatenated character string "Total Fat” obtained by concatenating the ruled box 504 and the ruled box 505, and the item value "Total Fat” belonging to the item name "Item A” in the knowledge database 4, and can be calculated from the standardized edit distance NED.
- variable L2 is the likelihood between "Total Fat 25g", which is a concatenated character string obtained by concatenating ruled box 504, ruled box 505, and ruled box 506, and item value "Total Fat” belonging to item name "Item B” in knowledge database 4, and can be calculated from the standardized edit distance NED.
- variable L1 and variable L2 are compared (step S311).
- variable L1 is the likelihood between the character string 514 "at” in the ruled frame 505 and "Trans Fat” belonging to the item name "Item B” in the knowledge database 4, and can be calculated from the standardized edit distance NED.
- variable L2 is the likelihood between the concatenated character string "at 25g” obtained by concatenating the character string in the ruled frame 505 and the character string in the ruled frame 506, and the item value "Trans Fat” belonging to the item name "Item B” in the knowledge database 4, and can be calculated from the standardized edit distance NED.
- the variables L1 and L2 are compared (step S311).
- step S313 a variable L1 is calculated.
- the variable L1 is the likelihood between the character string 515 “25g” in the ruled box 506 and the item name "[*NUM*]g" belonging to the item name “Item C” in the knowledge database 4, and can be calculated from the standardized edit distance NED.
- step S316 it is determined whether there is any overlap in the ruled line frames to be merged. Since there is no overlap in the ruled line frames to be merged (No in step S316), the character strings that were recognized after being split into character string 513 "Total F" and character string 514 "at” are concatenated into a single character string as the concatenated character string "Total Fat”.
- step S316 the likelihood is used to determine whether there is a character recognition error in the concatenated string or the individual string. Since the likelihood of the concatenated string formed by concatenating strings 513 "Total F" and 514 "at” is 1.0 (i.e., a perfect match with the string that should be entered in the item value), the concatenated string is determined to have been recognized correctly, and string 521 ("Total Fat") is output as the integrated string, which is both the final integrated judgment result and the recognition result. Furthermore, string 515 is treated as an individual string that is a single string that is not concatenated. Since the likelihood is 1.0 in this case, string 515 is determined to have been recognized correctly, as are strings 513 and 514, and string 522 (“25g”) is output as the integrated string, which is both the final integrated judgment result and the recognition result.
- lined box 508 i.e., character string 516 ("6aturated Fat")
- lined box 509 i.e., character string 517 (“9g")
- matching character strings registered in knowledge database 4 are evaluated.
- variable L1 is the likelihood between the character string 516 "Saturated Fat” in the ruled box 508 and the item value "Saturated Fat” belonging to the item name "Item B” in the knowledge database 4, and can be calculated from the standardized edit distance NED.
- variable L2 is the likelihood between "6aturated Fat9g", which is a concatenated character string obtained by concatenating the character string in ruled box 508 and the character string in ruled box 509, and the item value "Saturated Fat” belonging to the item name "Item B” in knowledge database 4, and can be calculated from the standardized edit distance NED.
- variables L1 and L2 are compared (step S311).
- step S313 a variable L1 is calculated.
- the variable L1 is the likelihood between the character string 517 “9g” in the ruled box 509 and the item name "[*NUM*]g" belonging to the item name “Item C” in the knowledge database 4, and can be calculated from the standardized edit distance NED.
- step S316 it is determined whether there is any overlap in the ruled line frames to be merged. Since there is no overlap in the ruled line frames to be merged (No in step S316), character string 516 in ruled line frame 508 and character string 517 in ruled line frame 509 are each treated as a single, unlinked character string.
- step S313 it is determined whether or not there is a character recognition error in the concatenated character string or the single character string using the likelihood.
- the likelihood of the character string 516 is 0.923, that is, it is not an exact match with the character string that should be written in the item value, so the value of the likelihood is compared with a predetermined threshold T2 for error determination.
- the predetermined threshold T2 for error determination is preferably, for example, 0.7, which can be set in advance. Since the value of the likelihood of the character string 516 (0.923) is equal to or greater than the predetermined threshold T2 (0.7) for error determination, it is presumed that there is an error in this character string.
- the character string 523 is replaced with the character string "Saturated Fat", which has the highest likelihood among the matching character strings registered in the knowledge database 4. Then, the character string 523 is output as an integrated character string, which is both the final integrated judgment result and the recognition result.
- the likelihood of character string 517 is 1.0
- character string 517 is determined to have been correctly recognized
- character string 524 (“9g") is output as an integrated character string which is both the final integrated judgment result and the recognition result.
- the concatenated string or the single string may be replaced with the string with the highest likelihood among the matching strings registered in the knowledge database 4. This is because the replacement will result in the same string.
- the likelihood is equal to or greater than a predetermined threshold T2 for error determination, including the case where the likelihood is 1.0
- the string may be replaced with the string with the highest likelihood among the matching strings registered in the knowledge database 4.
- the likelihood is equal to or greater than a predetermined threshold T2 for error determination, the string may be replaced with the matching string registered in the knowledge database 4 that was used to calculate the likelihood.
- the likelihood of the concatenated string is less than a predetermined threshold T2 (e.g., 0.7) for determining an error, it may be that, for example, the string itself has been correctly recognized, but the degree of match with the matching string registered in the knowledge database 4 is low. In that case, the concatenated string or the single string may be output as is, without being replaced with the matching string registered in the knowledge database 4.
- a predetermined threshold T2 e.g. 0.
- strings 518 to 524 are obtained as integrated strings obtained from the final integration decision of the ruled lines.
- the ruled line frame integration judgment unit 3 does not use the ruled line frame information of adjacent ruled line frames when judging the ruled line frame integration. Therefore, it is possible to correctly integrate ruled line frames without relying on the ruled line frame information.
- the ruled frame integration determination unit 3 also references the matching strings registered in the knowledge database 4, and determines that the concatenated string should be concatenated if it is not a meaningless string of characters and is likely to be meaningful (i.e., if it is judged to have a high likelihood and be close to the item name or item value). Therefore, it is possible to integrate ruled frames even if there is an error in part of the item name or item value (for example, a character recognition error, a typo, partial omission of the written content, etc.). Furthermore, it is possible to replace it with a string that is close to the matching string registered in the knowledge database 4, so the correct string can be output.
- the table recognition device described above in detail in embodiment 1 calculates the likelihood as the degree of match belonging to the item of the character recognition result within each frame line, and determines which character strings should be concatenated based on the calculated likelihood. Therefore, a character string written across a plurality of ruled frames can be accurately recognized without relying on ruled frame information.
- the table recognition device described in detail in the first embodiment refers to matching strings registered in a knowledge database, and determines that a concatenated string should be concatenated if it is highly likely that the concatenated string will make sense. Furthermore, the concatenated string is replaced with a string that is close to the matching string registered in the knowledge database. Therefore, even if there is an error in character recognition in the item name or in the item name, not only can character strings be concatenated accurately, but also errors in the character recognition results can be corrected at the same time, providing a synergistic effect.
- Embodiment 2 a knowledge database is used for the ruled frame integration judgment, but this is not limited to this.
- information on table structure constraints which is information that restricts compatible character strings that can be written as an integrated character string, can also be used. This configuration will be described as the second embodiment.
- FIG. 8 is a functional configuration diagram showing the configuration of the table recognition device 100 in the second embodiment.
- the new component compared to FIG. 1 is the table structure knowledge database 5.
- the other components and operations are the same as those in FIG. 1, and the description will be omitted.
- the table structure knowledge database 5 stores information on table structure constraints, which is information that restricts compatible strings that can be written in a ruled frame as an integrated string.
- the information on table structure constraints is information that restricts compatible strings that can be written in a ruled frame as an integrated string from among multiple compatible strings defined in the knowledge database 4 based on string information in the surrounding ruled frames. More specifically, for example, when table items represent classifications such as major items, medium items, and minor items, the information on table structure constraints is information that indicates the relationship between these items.
- the table structure knowledge database 5 may register compatible strings that can be written as item values belonging to item names in a similar manner to the knowledge database 4.
- Figure 9 is an example of the table structure knowledge database 5.
- a compatible string belonging to the item name "Item A” is registered in the left column. Also, in the right column, when a matching string (i.e., a constrained string) for the item name "Item A” is written, a matching string (called a writable string) that can be written for the adjacent item name "Item B" is registered.
- a matching string i.e., a constrained string
- a matching string i.e., a writable string
- the string of a matching string is not limited to a single word, and may be multiple words, phrases, or sentences.
- the ruled line frame integration determination unit 3 refers to the knowledge database 4, the table structure knowledge database 5, and the ruled line frame integration candidates C[j] stored in a memory MEM (not shown), and limits the multiple matching character strings to one or more matching character strings using information on the constraints of the table structure. It then calculates the degree of match between the limited one or more matching character strings and the character string recognized by the character recognition unit 2, and determines which ruled line frames should be integrated depending on the degree of match.
- the ruled line frame integration judgment unit 3 refers to the table structure knowledge database 5, and when the ruled line frame integration candidate C[j] corresponds to an item value (constrained string) belonging to a specified item name, the unit 3 restricts the item value candidates (i.e., integrated strings) belonging to other item names adjacent to the specified item name in the knowledge database 4 to one or more matching strings by restricting them to describable strings.
- the item value candidates i.e., integrated strings
- FIG. 10 is a flow chart showing the operation sequence of the ruled line integration determination unit 3 in the second embodiment.
- the steps that differ from FIG. 6 are step S309A and step S310A. Steps that are given the same numbers as in FIG. 6 perform the same processing as shown in the first embodiment, and therefore their explanations are omitted.
- step S309A the ruled boxes from the jth column to the (j+k)th column are merged, and the character strings in the ruled boxes are concatenated to obtain a concatenated character string. Then, by referring to the table constraint information database 5 and the ruled box merging candidate C[j-1] stored in the memory MEM, it is determined whether the ruled box merging candidate C[j-1], which is a concatenated character string adjacent to the obtained concatenated character string, is a constrained character string for the obtained concatenated character string (step S309A).
- the ruled frame merging candidate C[j-1] corresponds to the restricted string
- the matching string belonging to the jth item name (i.e., the jth column) in the knowledge database 4 is restricted to the describable strings described in the table structure knowledge database 5. Then, the likelihood [j+k,j] that the concatenated string belongs to the item name in the jth column is calculated, and the likelihood L[j+k,j] is substituted for the variable L1.
- step S310A the ruled boxes from the jth column to the (j+k+1)th column are merged, and the character strings in the ruled boxes are concatenated to obtain a concatenated character string. Then, similar to the processing in step S309A, the table constraint information database 5 and the ruled box merging candidate C[j-1] stored in the memory MEM are referenced to determine whether the ruled box merging candidate C[j-1], which is a concatenated character string adjacent to the obtained concatenated character string, is a constrained character string for the obtained concatenated character string (step S310A).
- the ruled frame merging candidate C[j-1] corresponds to the restricted string
- the matching string belonging to the jth item name (i.e., the jth column) in the knowledge database 4 is restricted to the describable strings described in the table structure knowledge database 5. Then, the likelihood [j+k+1,j] that the concatenated string belongs to the jth item name is calculated, and the likelihood L[j+k+1,j] is substituted for the variable L2.
- step S310A If the ruled frame merging candidate C[j-1] does not correspond to the restricted string, no restriction is performed using the table structure knowledge database 5, and the likelihood [j+k+1,j] that the concatenated string belongs to the jth item name is calculated, and the likelihood L[j+k+1,j] is substituted for the variable L2 (step S310A).
- the ruled line box integration judgment unit 3 refers to a table structure knowledge database 5 and imposes constraints on a knowledge database 4 used when calculating the likelihood that the character recognition results within each ruled line box belong to an item. Specifically, when character string 521 ("Total Fat") is obtained as an item value belonging to the item name "Item A" in FIG. 7(d), by referring to table structure knowledge database 5 in FIG. 9, the candidates for describable character strings for the item value belonging to the adjacent item name "Item B" are restricted to "Saturated Fat" or "Trans Fat.”
- the table recognition device described above in detail in embodiment 2 limits the matching character strings to one or more matching character strings in the ruled frame integration determination by using table structure constraint information, which is information that restricts the matching character strings that can be written as an integrated character string, thereby improving the accuracy of the ruled frame integration determination.
- Embodiment 3 In the integrated determination of ruled line frames, the determination can be made taking into consideration the possibility that characters adjacent to the ruled lines have been erroneously recognized. This configuration will be described as a third embodiment.
- the ruled line frame integration judgment 3 When calculating the likelihood that a character string in a recognition result belongs to a certain item, the ruled line frame integration judgment 3 "weights" each character to reduce the influence of erroneous recognition of characters close to ruled lines.
- the cost value for character conversion of characters close to a ruled line can be weighted lightly in cost calculation of character conversion such as insertion, deletion, and replacement when calculating the standardized edit distance.
- the weighting value of the cost value is preferably, for example, 0.5 compared to the usual 1, but is not limited to this.
- the weighting value of the cost value can be appropriately changed depending on the type of the ruled line, etc.
- a character is close to a ruled line
- the character is determined to be close to the ruled line.
- the distance from the ruled line to the character may be used to determine whether the character is close to the ruled line.
- the distance from the ruled line to the character is closer than a predetermined threshold, the character is determined to be close to the ruled line.
- the threshold value for the distance from the ruled line to the character can be set in advance to a value corresponding to, for example, the thickness of the ruled line or the size of the character.
- the threshold value for the distance from the ruled line to the character can be set to a distance three times the thickness of the ruled line.
- the number of characters determined to be close in a character string within one ruled line frame is not limited to one character. For example, in a three-character character string "ABC,” if the distance from the ruled line to the characters "B” and “C” is closer than a predetermined threshold, the characters "B" and “C” are determined to be close to the ruled line. In other words, both the characters "B” and “C” can be subject to weighting of the cost value.
- FIG. 11 is a specific example of the operation of the table recognition device of this embodiment 3.
- FIG. 11(a) is an example of a table to be recognized. The table shown in FIG. 11(a) has "Total Fat" as an item value belonging to the item name "Item A" and "25g" as an item value belonging to the item name "Item C”.
- FIG. 11(b) is an example of the table structure recognition result for FIG. 11(a).
- FIG. 11(c) is an example of the character recognition result for FIG. 11(b).
- FIG. 11(d) is an example of the character string recognition result obtained by integrating the ruled frame of FIG. 11(c).
- the table structure recognition unit 1 recognizes ruled frame 601 to ruled frame 606.
- the character recognition unit 2 recognizes character strings 607 to 612 within each ruled frame.
- character string 610 the character "F” adjacent to the vertical double ruled line is mistakenly recognized as the character "P.”
- character string 611 the character "a” adjacent to the vertical double ruled line is mistakenly recognized as the character "p.”
- the likelihood is 0.778 when the cost value is not weighted, whereas the likelihood is 0.889 when the cost value is weighted.
- the likelihood is higher than when the cost value is not weighted, and the possibility that a character string of another item name will be mistakenly adopted (as the correct character string) can be reduced. This can further improve the accuracy of the ruled frame integration judgment.
- the cost value is weighted lightly for characters close to vertical lines, but this is not limited to the above.
- the same processing can be performed on characters close to horizontal lines, and the same effect as described above can be achieved.
- the cost value for character conversion of characters close to a ruled line is weighted lightly. This reduces the influence of characters that are more likely to be erroneously recognized compared to other characters, thereby further improving the accuracy of the ruled line frame integration determination.
- likelihood has been shown as an example of the degree of similarity between two character strings, but this is not limiting.
- character strings may be represented as vectors, and the cosine similarity between two character string vectors may be used as the degree of similarity. For example, when the cosine similarity is close to 1, the two character vectors are similar and the degree of similarity is high; on the other hand, when the cosine similarity is close to 0, the two character vectors are not similar and the degree of similarity is low.
- the process of determining whether or not to merge ruled lines is not limited to languages that are written horizontally or left-to-right.
- the table recognition device according to the above-described embodiments can also be applied to tables in which the rows and columns are swapped, such as vertically written documents.
- the table recognition device according to the above-described embodiments can also be applied to languages in which writing starts from the right, such as Arabic.
- any other configuration may be used as long as it provides similar functions and effects.
- any component of the embodiment may be modified or omitted.
- Table structure recognition unit 1 Table structure recognition unit, 2 Character recognition unit, 3 Ruled line frame integration determination unit, 4 Knowledge database, 5 Table structure knowledge database, 100 table recognition device, 101 processor, 102 memory, 103 external storage device, 104 input/output interface.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Character Input (AREA)
Abstract
The purpose of the present invention is to achieve accurate recognition of a character string written across a plurality of ruled line frames without relying on ruled line frame information. The present invention provides a table recognition device that recognizes a character string written in a tabular document from image information of the tabular document, the table recognition device comprising: a character recognition unit that recognizes character strings each written inside each of a plurality of ruled line frames provided in the tabular document; and a ruled line frame merge determination unit that identifies, from among an independent character string recognized for a target ruled line frame treated as a target from among the plurality of ruled line frames and a concatenated character string obtained by concatenating the independent character string with a character string recognized for a ruled line frame different from the target ruled line frame the independent character string or the concatenated character string having a higher degree of agreement with a conforming character string to be written in the tabular document as a unified character string that belongs to the target ruled line frame.
Description
本開示は、表形式文書の画像情報から文字認識を行うための、表認識装置及び方法に関する。
The present disclosure relates to a table recognition device and method for performing character recognition from image information of a table-format document.
The present disclosure relates to a table recognition device and method for performing character recognition from image information of a table-format document.
画像化した表形式文書の文字認識では、表の罫線を抽出し、表の行と列とを罫線で区切った複数の領域(複数の罫線枠)に分離する。そして、罫線枠ごとに文字認識を行い、その文字認識結果を個別に保持する。そのため、項目名又は項目値となる罫線枠内の文字列が罫線枠を超過し、隣接する罫線枠にはみ出すなど、一つの項目名又は項目値が複数の罫線枠にまたがって記載されている場合、項目名又は項目値(文字列)を認識できない場合がある。
When recognizing characters from an image of a tabular document, the table's borders are extracted and the rows and columns of the table are separated into multiple areas (multiple border frames) by the borders. Character recognition is then performed for each border frame, and the character recognition results are stored separately. For this reason, if a single item name or item value is written across multiple border frames, such as when the character string within a border frame that is to become an item name or item value exceeds the border frame and spills over into an adjacent border frame, the item name or item value (character string) may not be recognized.
そこで、隣接する罫線枠の罫線枠情報が予め設定された条件を満たす場合、例えば、隣接する罫線枠が共に同じ太さの実線で描かれている場合に罫線枠を統合することで、項目名又は項目値(文字列)を認識する技術が開示されている(例えば、特許文献1)。
この従来技術によれば、複数の罫線枠にまたがって記載された文字列を認識することができる。
Therefore, a technology has been disclosed that recognizes an item name or item value (character string) by merging ruled frames when the ruled frame information of adjacent ruled frames satisfies a preset condition, for example, when adjacent ruled frames are both drawn with solid lines of the same thickness (for example, Patent Document 1).
According to this conventional technique, it is possible to recognize a character string that is written across multiple ruled frames.
この従来技術によれば、複数の罫線枠にまたがって記載された文字列を認識することができる。
Therefore, a technology has been disclosed that recognizes an item name or item value (character string) by merging ruled frames when the ruled frame information of adjacent ruled frames satisfies a preset condition, for example, when adjacent ruled frames are both drawn with solid lines of the same thickness (for example, Patent Document 1).
According to this conventional technique, it is possible to recognize a character string that is written across multiple ruled frames.
しかしながら、従来技術では以下のような問題がある。例えば、隣接する罫線枠の罫線の種類が相互に異なるなど、隣接する罫線枠の罫線枠情報が異なる場合があり、このような場合、予め設定された条件を満たさない。よって、当該条件を満たさない複数の罫線枠にまたがって記載された文字列を認識することができない。
However, the conventional technology has the following problems. For example, adjacent ruled frames may have different types of lines, and the ruled frame information of adjacent ruled frames may differ. In such cases, the preset conditions are not met. Therefore, it is not possible to recognize a character string that is written across multiple ruled frames that do not meet the conditions.
本開示は、上記のような課題を解決するためになされたものであり、罫線枠情報に依存せずに、複数の罫線枠にまたがって記載された文字列を正確に認識できるようにすることを目的とする。
The present disclosure has been made to solve the above-mentioned problems, and aims to make it possible to accurately recognize character strings that span multiple ruled frames without relying on ruled frame information.
The present disclosure has been made to solve the above-mentioned problems, and aims to make it possible to accurately recognize character strings that span multiple ruled frames without relying on ruled frame information.
本開示の表認識装置は、
表形式文書の画像情報から当該表形式文書に記述された文字列を認識する表認識装置であって、
前記表形式文書に設けられた複数の罫線枠内にそれぞれ記述された文字列を認識する文字認識部と、
前記複数の罫線枠のうちの対象となる罫線枠である対象罫線枠について認識された文字列である単独文字列と、前記対象罫線枠とは異なる前記罫線枠について認識された文字列と前記単独文字列とを連結した連結文字列のうち、
前記表形式文書に記述されるべき適合文字列との一致度が高い方の前記単独文字列もしくは前記連結文字列を、前記対象罫線枠に属する文字列である統合文字列として判別する罫線枠統合判定部とを備える。 The table recognition device of the present disclosure is
A table recognition device that recognizes character strings described in a table-formatted document from image information of the table-formatted document,
a character recognition unit that recognizes character strings written within each of a plurality of ruled frames provided in the table format document;
a single character string that is a character string recognized for a target ruled frame that is a target ruled frame among the plurality of ruled frames, and a concatenated character string that is a concatenation of a character string recognized for a ruled frame other than the target ruled frame and the single character string,
The apparatus further includes a ruled line frame integration determination unit that determines the single character string or the concatenated character string that has a higher degree of match with the matching character string to be written in the table format document as an integrated character string that belongs to the target ruled line frame.
表形式文書の画像情報から当該表形式文書に記述された文字列を認識する表認識装置であって、
前記表形式文書に設けられた複数の罫線枠内にそれぞれ記述された文字列を認識する文字認識部と、
前記複数の罫線枠のうちの対象となる罫線枠である対象罫線枠について認識された文字列である単独文字列と、前記対象罫線枠とは異なる前記罫線枠について認識された文字列と前記単独文字列とを連結した連結文字列のうち、
前記表形式文書に記述されるべき適合文字列との一致度が高い方の前記単独文字列もしくは前記連結文字列を、前記対象罫線枠に属する文字列である統合文字列として判別する罫線枠統合判定部とを備える。 The table recognition device of the present disclosure is
A table recognition device that recognizes character strings described in a table-formatted document from image information of the table-formatted document,
a character recognition unit that recognizes character strings written within each of a plurality of ruled frames provided in the table format document;
a single character string that is a character string recognized for a target ruled frame that is a target ruled frame among the plurality of ruled frames, and a concatenated character string that is a concatenation of a character string recognized for a ruled frame other than the target ruled frame and the single character string,
The apparatus further includes a ruled line frame integration determination unit that determines the single character string or the concatenated character string that has a higher degree of match with the matching character string to be written in the table format document as an integrated character string that belongs to the target ruled line frame.
本開示の表認識方法は、
表形式文書の画像情報から当該表形式文書に記述された文字列を認識する表認識方法であって、
文字認識部が、前記表形式文書に設けられた複数の罫線枠内にそれぞれ記述された文字列を認識し、
罫線枠統合判定部が、前記複数の罫線枠のうちの対象となる罫線枠である対象罫線枠について認識された文字列である単独文字列と、前記対象罫線枠とは異なる前記罫線枠について認識された文字列と前記単独文字列とを連結した連結文字列のうち、
前記表形式文書に記述されるべき適合文字列との一致度が高い方の文字列もしくは連結文字列を、前記対象罫線枠に属する文字列である統合文字列として判別する。
The table recognition method of the present disclosure comprises:
A table recognition method for recognizing a character string described in a table-formatted document from image information of the table-formatted document, comprising the steps of:
a character recognition unit that recognizes character strings written within each of a plurality of ruled frames provided in the table format document;
a single character string that is a character string recognized for a target ruled line frame that is a target ruled line frame among the plurality of ruled line frames, and a concatenated character string that is a concatenation of a character string recognized for a ruled line frame other than the target ruled line frame and the single character string,
The character string or the concatenated character string that matches the matching character string to be written in the table format document is determined as an integrated character string that belongs to the target ruled frame.
表形式文書の画像情報から当該表形式文書に記述された文字列を認識する表認識方法であって、
文字認識部が、前記表形式文書に設けられた複数の罫線枠内にそれぞれ記述された文字列を認識し、
罫線枠統合判定部が、前記複数の罫線枠のうちの対象となる罫線枠である対象罫線枠について認識された文字列である単独文字列と、前記対象罫線枠とは異なる前記罫線枠について認識された文字列と前記単独文字列とを連結した連結文字列のうち、
前記表形式文書に記述されるべき適合文字列との一致度が高い方の文字列もしくは連結文字列を、前記対象罫線枠に属する文字列である統合文字列として判別する。
The table recognition method of the present disclosure comprises:
A table recognition method for recognizing a character string described in a table-formatted document from image information of the table-formatted document, comprising the steps of:
a character recognition unit that recognizes character strings written within each of a plurality of ruled frames provided in the table format document;
a single character string that is a character string recognized for a target ruled line frame that is a target ruled line frame among the plurality of ruled line frames, and a concatenated character string that is a concatenation of a character string recognized for a ruled line frame other than the target ruled line frame and the single character string,
The character string or the concatenated character string that matches the matching character string to be written in the table format document is determined as an integrated character string that belongs to the target ruled frame.
本開示によれば、罫線枠情報に依存せずに、複数の罫線枠にまたがって記載された文字列を正確に認識する効果を有する。
According to the present disclosure, it is possible to accurately recognize a character string that is written across multiple ruled frames without relying on ruled frame information.
According to the present disclosure, it is possible to accurately recognize a character string that is written across multiple ruled frames without relying on ruled frame information.
実施の形態の説明及び図面において、同じ要素及び対応する要素には同じ符号を付している。同じ符号が付された要素の説明は、適宜に省略又は簡略化する。以下の実施の形態では、「部」を「回路」、「工程」、「手順」又は「処理」に適宜読み替えてもよい。
In the description of the embodiments and drawings, the same elements and corresponding elements are given the same reference numerals. Descriptions of elements given the same reference numerals are omitted or simplified as appropriate. In the following embodiments, "part" may be read as "circuit," "process," "procedure," or "process" as appropriate.
実施の形態1.
<構成>
実施の形態1における表認識装置について、図1~図7を用いて説明する。図1は、実施の形態1における表認識装置100の構成を表す機能構成図である。図1において、表認識装置100は、表構造認識部1、文字認識部2、罫線枠総合判定部3、知識データベース4で構成されている。Embodiment 1.
<Configuration>
The table recognition device in theembodiment 1 will be described with reference to Fig. 1 to Fig. 7. Fig. 1 is a functional block diagram showing the configuration of a table recognition device 100 in the embodiment 1. In Fig. 1, the table recognition device 100 is made up of a table structure recognition unit 1, a character recognition unit 2, a ruled line frame comprehensive determination unit 3, and a knowledge database 4.
<構成>
実施の形態1における表認識装置について、図1~図7を用いて説明する。図1は、実施の形態1における表認識装置100の構成を表す機能構成図である。図1において、表認識装置100は、表構造認識部1、文字認識部2、罫線枠総合判定部3、知識データベース4で構成されている。
<Configuration>
The table recognition device in the
図2は、実施の形態1における、表認識装置100の認識対象である表形式文書の一例を表す図である。図2に示される表の1行目の1列目には項目名「Item A」、2列目には「Item B」、3列目には「Item C」がそれぞれ記載されている。表の2行目以降には、それぞれの項目名に属する項目値が記載されている。図2に示されるように、表形式文書は、印刷の影響、あるいは文字数超過等により、文字が罫線枠からはみ出る場合が生じたり、あるいは文字が罫線枠に接触したりする場合がある。具体的には、図2に示される表の場合では、項目名「Item A」に属する項目値「Total Fat」は、文字「at」が項目名「Item B」の罫線枠にはみ出ている状態である。また、項目名「Item B」に属する項目値「Saturated Fat」は、語頭の文字「S」が罫線枠に接触している状態である。
2 is a diagram showing an example of a tabular document to be recognized by the table recognition device 100 in the first embodiment. In the first row and first column of the table shown in FIG. 2, the item name "Item A" is written, in the second column "Item B" is written, and in the third column "Item C" is written. The second row and subsequent rows of the table contain item values belonging to each item name. As shown in FIG. 2, in a tabular document, due to the effects of printing or the number of characters being too large, characters may protrude from the ruled frame or may touch the ruled frame. Specifically, in the case of the table shown in FIG. 2, the item value "Total Fat" belonging to the item name "Item A" is in a state where the character "at" protrudes into the ruled frame of the item name "Item B". Also, the item value "Saturated Fat" belonging to the item name "Item B" is in a state where the initial character "S" touches the ruled frame.
表構造認識部1は、表形式文書の画像情報から罫線を抽出し、表構造を認識する。表構造は、表の行と列とを罫線で区切った複数の領域(すなわち、複数の罫線枠)から構成されている。
The table structure recognition unit 1 extracts lines from the image information of a table-format document and recognizes the table structure. The table structure is made up of multiple areas (i.e., multiple ruled frames) that separate the rows and columns of the table with lines.
表構造を認識する方法は、例えば、エッジヒストグラムに基づく方法を用いることができる。具体的には、表形式文書の画像情報から、表の領域(表領域)内部の白画素塊の境界近傍において2方向のエッジを求める。ここで、白画素塊は、白色以外の他色の罫線枠で囲まれた白地の領域である。そして、得られた2方向のエッジのそれぞれから求めたエッジヒストグラムに基づいて、罫線の一部分情報を取得する。さらに取得した罫線の一部分情報に基づいて、表領域の罫線情報を取得し、表構造を認識する。なお、表構造を認識する方法はこれに限らない。例えば、表の行と列の構造の情報が得られるものであれば、様々な方法を用いることができる。
The method of recognizing the table structure can be, for example, a method based on edge histograms. Specifically, from the image information of a tabular document, edges in two directions are obtained near the boundaries of white pixel clusters inside the table area (table area). Here, a white pixel cluster is a white area surrounded by a border frame of a color other than white. Then, based on the edge histograms obtained from each of the obtained edges in the two directions, partial information of the borders is obtained. Furthermore, based on the obtained partial information of the borders, border information of the table area is obtained and the table structure is recognized. Note that the method of recognizing the table structure is not limited to this. For example, various methods can be used as long as they can obtain information on the row and column structure of a table.
文字認識部2は、例えば、光学文字認識(OCR:Optical Character Recognition)の技術を用いて、罫線枠内の文字列を認識する。なお、罫線枠内の文字列を認識する方法はOCRに限られず、これ以外の方法であってもよい。
The character recognition unit 2 recognizes the character string within the ruled frame, for example, using optical character recognition (OCR) technology. Note that the method for recognizing the character string within the ruled frame is not limited to OCR, and other methods may also be used.
罫線枠統合判定部3は、文字認識部2が認識した文字列と、知識データベース4に登録されている適合文字列との一致度の高さに応じて、どの罫線枠を統合すべきかを判定する。言い換えれば、各罫線枠内の文字列が、隣接する左右の罫線枠内の文字列と連結されるべきか判定する。例えば、一致度として、尤度を用いることができる。尤度は、任意の文字列が、ある文字列群に属すると推定される「尤もらしさ」を表す値である。例えば、尤度は、2つの文字列の間の標準化編集距離を用いることができる。ここで、標準化編集距離は、編集距離を長い方の文字列の長さで除算した値である。ただし、編集距離は、一文字の挿入、削除、置換によって、一方の文字列を他方の文字列に変形するのに必要な操作の最小回数である。尤度は、2つの文字列の間の一致度と読み替えてもよい。
そして、判定結果に基づいて罫線枠内の文字列を連結し、連結した文字列である連結文字列、もしくは連結しない単独の文字列である単独文字列を、最終的な統合判断結果である統合文字列として出力する。 The ruled line frameintegration determination unit 3 determines which ruled line frames should be integrated according to the degree of matching between the character string recognized by the character recognition unit 2 and the matching character string registered in the knowledge database 4. In other words, it determines whether the character string in each ruled line frame should be linked to the character strings in the adjacent ruled line frames on the left and right. For example, the degree of matching can be a likelihood. The likelihood is a value that indicates the "likelihood" that an arbitrary character string is estimated to belong to a certain character string group. For example, the likelihood can be the standardized edit distance between two character strings. Here, the standardized edit distance is a value obtained by dividing the edit distance by the length of the longer character string. However, the edit distance is the minimum number of operations required to transform one character string into the other character string by inserting, deleting, or replacing one character. The likelihood may be read as the degree of matching between two character strings.
Then, based on the judgment result, the character strings within the ruled frame are concatenated, and either the concatenated character string (concatenated character string) or the single character string (unconcatenated character string) is output as an integrated character string, which is the final integration judgment result.
そして、判定結果に基づいて罫線枠内の文字列を連結し、連結した文字列である連結文字列、もしくは連結しない単独の文字列である単独文字列を、最終的な統合判断結果である統合文字列として出力する。 The ruled line frame
Then, based on the judgment result, the character strings within the ruled frame are concatenated, and either the concatenated character string (concatenated character string) or the single character string (unconcatenated character string) is output as an integrated character string, which is the final integration judgment result.
知識データベース4は、項目名、及び項目名に属する項目値として記述されるべき文字列である適合文字列を定める。図3に、知識データベース4の適合文字列の内容の一例を示す。知識データベース4には、項目名毎に適合文字列が一又は複数個登録されている。言い換えれば、知識データベース4に定められている項目値の適合文字列のいずれかが、表形式文書の項目名毎に項目値の罫線枠内に記述され得る。なお、適合文字列は、1つの単語に限らず、複数の単語から構成される熟語、もしくは文章であってもよい。図3中のIDXは、適合文字列のそれぞれに対し、個別に付与されるインデックス番号である。このIDXは、知識データベース4中の適合文字列を参照する際に、適合文字列を指定するために用いられる。具体的には、図3において、例えば、項目名「Item A」に属する項目値のそれぞれについて、順にA1、A2、・・・、A9のインデックス番号が付与されている。なお、適合文字列については、図3に示された文字列に限らず、任意に設定できる。
また、知識データベース4の項目値に関して、任意の文字又は数値を表す、いわゆるワイルドカードを設定することができる。具体的には、図3の項目名「Item C」に属する適合文字列において、[*NUM*]はワイルドカードを意味しており、ワイルドカードには任意の数値を当てはめることができる。知識データベース4の項目値にワイルドカードを設定することで、文字又は数値毎に適合文字列を設定する必要が無くなり、知識データベース4の記憶容量を削減することができる。また、文字又は数値毎に適合文字列を参照するための計算量も削減することができる。
また、表認識の結果、認識された文字列に誤りがあって所定の条件を満たす場合、知識データベース4に登録されている項目名、もしくは項目値の文字列へ置換されてもよい。あるいは、登録されていない項目名、もしくは項目値の文字列は、誤認識結果として認識対象から除外されてもよい。置換のための具体的な条件については、後述する。 Theknowledge database 4 defines the item name and the matching string, which is a string to be written as the item value belonging to the item name. FIG. 3 shows an example of the contents of the matching string in the knowledge database 4. In the knowledge database 4, one or more matching strings are registered for each item name. In other words, any of the matching strings for the item value defined in the knowledge database 4 can be written within the ruled frame of the item value for each item name of the tabular document. The matching string is not limited to one word, but may be a phrase or a sentence consisting of multiple words. The IDX in FIG. 3 is an index number that is individually assigned to each matching string. This IDX is used to specify the matching string when referring to the matching string in the knowledge database 4. Specifically, in FIG. 3, for example, index numbers A1, A2, ..., A9 are assigned to each item value belonging to the item name "Item A" in order. The matching string is not limited to the string shown in FIG. 3, and can be set arbitrarily.
Furthermore, a so-called wildcard representing any character or number can be set for the item value of theknowledge database 4. Specifically, in the matching character string belonging to the item name "Item C" in Fig. 3, [*NUM*] means a wildcard, and any number can be substituted for the wildcard. By setting a wildcard for the item value of the knowledge database 4, it becomes unnecessary to set a matching character string for each character or number, and the storage capacity of the knowledge database 4 can be reduced. Also, the amount of calculation required to refer to a matching character string for each character or number can be reduced.
Furthermore, if the recognized character string has an error as a result of table recognition and satisfies a predetermined condition, it may be replaced with a character string of an item name or an item value registered in theknowledge database 4. Alternatively, a character string of an item name or an item value that is not registered may be excluded from the recognition target as a result of erroneous recognition. Specific conditions for replacement will be described later.
また、知識データベース4の項目値に関して、任意の文字又は数値を表す、いわゆるワイルドカードを設定することができる。具体的には、図3の項目名「Item C」に属する適合文字列において、[*NUM*]はワイルドカードを意味しており、ワイルドカードには任意の数値を当てはめることができる。知識データベース4の項目値にワイルドカードを設定することで、文字又は数値毎に適合文字列を設定する必要が無くなり、知識データベース4の記憶容量を削減することができる。また、文字又は数値毎に適合文字列を参照するための計算量も削減することができる。
また、表認識の結果、認識された文字列に誤りがあって所定の条件を満たす場合、知識データベース4に登録されている項目名、もしくは項目値の文字列へ置換されてもよい。あるいは、登録されていない項目名、もしくは項目値の文字列は、誤認識結果として認識対象から除外されてもよい。置換のための具体的な条件については、後述する。 The
Furthermore, a so-called wildcard representing any character or number can be set for the item value of the
Furthermore, if the recognized character string has an error as a result of table recognition and satisfies a predetermined condition, it may be replaced with a character string of an item name or an item value registered in the
<ハードウェア>
次に、実施の形態1における表認識装置100が有するハードウェアを説明する。図4は、表認識装置100のハードウェア構成図である。表認識装置100は、プロセッサ101、メモリ102、外部記憶装置103及び入出力インタフェース104を有する。 <Hardware>
Next, a description will be given of the hardware of thetable recognition device 100 in embodiment 1. Fig. 4 is a hardware configuration diagram of the table recognition device 100. The table recognition device 100 has a processor 101, a memory 102, an external storage device 103, and an input/output interface 104.
次に、実施の形態1における表認識装置100が有するハードウェアを説明する。図4は、表認識装置100のハードウェア構成図である。表認識装置100は、プロセッサ101、メモリ102、外部記憶装置103及び入出力インタフェース104を有する。 <Hardware>
Next, a description will be given of the hardware of the
プロセッサ101は、表認識装置100全体を制御する。例えば、プロセッサ101は、CPU(Central Processing Unit)、FPGA(Field Programmable Gate Array)などである。プロセッサ101は、マルチプロセッサでもよい。また、表認識装置100は、処理回路を有してもよい。
The processor 101 controls the entire table recognition device 100. For example, the processor 101 is a CPU (Central Processing Unit) or an FPGA (Field Programmable Gate Array). The processor 101 may be a multiprocessor. The table recognition device 100 may also have a processing circuit.
メモリ102は、表認識装置100の主記憶装置である。例えば、メモリ102は、RAM(Random Access Memory)などである。外部記憶装置103は、表認識装置100の補助記憶装置である。例えば、外部記憶装置103は、HDD(Hard Disk Drive)、又はSSD(Solid State Drive)などである。入出力インタフェース104は、表認識装置100に接続される外部装置とデータの送受を行うインタフェースである。例えば、入出力インタフェース104は、NIC(Network Interface Controller)などである。例えば、外部装置は、イメージスキャナ、ディスプレイなどである。なお、外部装置の図示は省略されている。
The memory 102 is the main storage device of the table recognition device 100. For example, the memory 102 is a RAM (Random Access Memory). The external storage device 103 is an auxiliary storage device of the table recognition device 100. For example, the external storage device 103 is a HDD (Hard Disk Drive) or an SSD (Solid State Drive). The input/output interface 104 is an interface that transmits and receives data to and from an external device connected to the table recognition device 100. For example, the input/output interface 104 is a NIC (Network Interface Controller). For example, the external device is an image scanner, a display, etc. Note that illustration of the external device is omitted.
プロセッサ101は、外部記憶装置103に記憶されている表認識プログラムを、メモリ102に読み出し、プロセッサ101が、そのプログラムを実行することで、表認識方法の各処理を実現することができる。外部記憶装置103は、実施の形態1の表認識方法を実現するためのプログラム及びデータを保持する。なお、表認識プログラムは、ネットワークを通じて提供されてもよく、また、コンピュータで読み込み可能な記録媒体に記録されて提供されてもよい。即ち、表認識プログラムは、例えば、プログラムプロダクトとして提供されてもよい。
The processor 101 reads the table recognition program stored in the external storage device 103 into the memory 102, and the processor 101 executes the program, thereby realizing each process of the table recognition method. The external storage device 103 holds the program and data for realizing the table recognition method of the first embodiment. The table recognition program may be provided over a network, or may be provided by being recorded on a computer-readable recording medium. That is, the table recognition program may be provided, for example, as a program product.
入出力インタフェース104は、表形式文書の画像情報をイメージスキャナなどの外部装置より受信すると共に、表認識結果をディスプレイなどの外部装置に出力する。
The input/output interface 104 receives image information of a tabular document from an external device such as an image scanner, and outputs the table recognition results to an external device such as a display.
<フローチャート>
次に、実施の形態1における表認識装置100の動作について説明する。図5は、実施の形態1の表認識装置100の動作順序を表すフローチャートである。説明を簡単にするため、表の1行目(すなわち、項目名が記載されている行)の内容は既知であるとし、2行目以降の項目値の罫線枠のみの統合方法について説明する。また、項目名の行(1行目)の処理を省略するので、項目値の先頭行である2行目を新たな1行目と見なして説明する。 <Flowchart>
Next, the operation oftable recognition device 100 in embodiment 1 will be described. Fig. 5 is a flowchart showing the operation sequence of table recognition device 100 in embodiment 1. For ease of explanation, it is assumed that the contents of the first row of a table (i.e., the row in which the item names are written) are known, and a method for integrating only the ruled frames of item values from the second row onwards will be described. Also, since processing of the row of item names (first row) will be omitted, the second row, which is the first row of item values, will be considered as the new first row in the explanation.
次に、実施の形態1における表認識装置100の動作について説明する。図5は、実施の形態1の表認識装置100の動作順序を表すフローチャートである。説明を簡単にするため、表の1行目(すなわち、項目名が記載されている行)の内容は既知であるとし、2行目以降の項目値の罫線枠のみの統合方法について説明する。また、項目名の行(1行目)の処理を省略するので、項目値の先頭行である2行目を新たな1行目と見なして説明する。 <Flowchart>
Next, the operation of
まず、ステップS1において、表構造認識部1が、表形式文書の画像情報から罫線を抽出し、複数の罫線枠から構成される表構造を認識する。更に、表構造認識部1が、認識された表構造から、表形式文書の表の行数及びそれぞれの行の列数を取得する(ステップS1)。
First, in step S1, the table structure recognition unit 1 extracts ruled lines from the image information of the tabular document and recognizes the table structure consisting of multiple ruled line frames. Furthermore, the table structure recognition unit 1 obtains the number of rows in the table of the tabular document and the number of columns in each row from the recognized table structure (step S1).
ステップS2において、文字認識部2が、表構造認識部1がステップS1で認識した罫線枠内の文字を認識する(ステップS2)。
In step S2, the character recognition unit 2 recognizes the characters within the ruled frame that the table structure recognition unit 1 recognized in step S1 (step S2).
次に、ステップS3において、罫線枠統合判定部3が、知識データベース4を参照し、ステップS2で得られた文字認識結果に基づいて、どの罫線枠を統合するか判定する。そして、判定結果に基づいて、隣接する罫線枠内の文字列を連結し、連結した文字列である連結文字列、もしくは連結しない単独の文字列である単独文字列を出力する(ステップS3)。
Next, in step S3, the ruled line frame integration determination unit 3 refers to the knowledge database 4 and determines which ruled line frames to integrate based on the character recognition results obtained in step S2. Then, based on the determination result, it concatenates the character strings in adjacent ruled line frames and outputs a concatenated character string, which is the concatenated character string, or a single character string, which is a single character string that is not concatenated (step S3).
図6は、ステップS3の罫線枠統合判定部3の動作順序を表すフローチャートである。図中の「←」は、右辺の数値又は要素を、左辺の変数へ代入する処理を表している。また説明を簡略化するため、変数が保持する値を変数の記号名で示す場合がある。また、左上を原点とした表についての動作である。
FIG. 6 is a flowchart showing the sequence of operations performed by the ruled frame integration determination unit 3 in step S3. The "←" in the diagram represents the process of substituting the value or element on the right side for the variable on the left side. To simplify the explanation, the value held by the variable may be indicated by the symbolic name of the variable. The operations are performed for a table with the upper left corner as the origin.
まず、ステップS301において、表の行の位置を表す変数iに1を代入する(ステップS301)。
First, in step S301, the variable i, which indicates the position of a row in the table, is assigned the value 1 (step S301).
ステップS302において、変数iが表の行数以下であるか確認する。変数iが表の行数以下の場合(ステップS302のYes)、処理をステップS303に進める。変数iが表の行数を超える場合(ステップS302のNo)、全ての行の判定が完了したため、処理を終了する(END)。
In step S302, it is confirmed whether the variable i is equal to or less than the number of rows in the table. If the variable i is equal to or less than the number of rows in the table (Yes in step S302), the process proceeds to step S303. If the variable i is greater than the number of rows in the table (No in step S302), the process ends (END) since all rows have been evaluated.
ステップS303において、表の列の位置を表す変数jに1を代入する(ステップS303)。
In step S303, assign 1 to the variable j, which represents the position of the column in the table (step S303).
ステップS304において、変数jが、i行に属する項目値の罫線枠の合計数(以下、項目数)以下であるか確認する(ステップS304)。変数jが項目数以下の場合(ステップS304のYes)、処理をステップS305に進める。変数jが項目数を超える場合(ステップS304のNo)、処理をステップS316に進める。
In step S304, it is confirmed whether variable j is less than or equal to the total number of ruled frames for item values belonging to row i (hereinafter, the number of items) (step S304). If variable j is less than or equal to the number of items (Yes in step S304), the process proceeds to step S305. If variable j exceeds the number of items (No in step S304), the process proceeds to step S316.
ステップS305において、i行のj列目の罫線枠内に文字認識結果があるか確認する(ステップS305)。i行のj列目の罫線枠内に文字認識結果がある場合(ステップS305のYes)、処理をステップS306に進める。i行のj列目の罫線枠内に文字認識結果がない場合(ステップS305のNo)、処理をステップS315に進める。
In step S305, it is confirmed whether there is a character recognition result within the ruled frame in the jth column of the ith row (step S305). If there is a character recognition result within the ruled frame in the jth column of the ith row (Yes in step S305), the process proceeds to step S306. If there is no character recognition result within the ruled frame in the jth column of the ith row (No in step S305), the process proceeds to step S315.
ステップS306において、変数kに0を代入する(ステップS306)。kは、i行のj列目の罫線枠に対して、統合する他の罫線枠の列の個数を表す。具体的には、k=0の場合、i行のj列目の罫線枠は統合せず単独の罫線枠として取り扱われ、k=1の場合、i行のj列目の罫線枠に対して、隣接する1個の他の罫線枠が統合される。
In step S306, 0 is assigned to the variable k (step S306). k represents the number of columns of other ruled lines to be merged with the ruled line frame in the jth column of the ith row. Specifically, when k=0, the ruled line frame in the jth column of the ith row is not merged and is treated as a single ruled line frame, and when k=1, one other adjacent ruled line frame is merged with the ruled line frame in the jth column of the ith row.
ステップS307において、変数kの値が、項目数から変数jの値と1とを減算した値以下であるか確認する(ステップS307)。変数kの値が、項目数から変数jの値と1とを減算した値以下である場合(ステップS307のYes)、処理をステップS308に進める。変数kの値が、項目数から変数jの値と1とを減算した値より大きい場合(ステップS307のNo)は、処理をステップS313に進める。
In step S307, it is confirmed whether the value of variable k is less than or equal to the number of items minus the value of variable j and 1 (step S307). If the value of variable k is less than or equal to the number of items minus the value of variable j and 1 (Yes in step S307), the process proceeds to step S308. If the value of variable k is greater than the number of items minus the value of variable j and 1 (No in step S307), the process proceeds to step S313.
ステップS308において、i行の(j+k)列目の罫線枠内に文字認識結果があるか確認する(ステップS308)。i行の(j+k)列目の罫線枠内に文字認識結果がある場合(ステップS308のYes)、処理をステップS309に進める。i行の(j+k)列目の罫線枠内に文字認識結果がない場合(ステップS308のNo)、処理をステップS313に進める。
In step S308, it is confirmed whether there is a character recognition result within the ruled frame in the (j+k)th column of the ith row (step S308). If there is a character recognition result within the ruled frame in the (j+k)th column of the ith row (Yes in step S308), the process proceeds to step S309. If there is no character recognition result within the ruled frame in the (j+k)th column of the ith row (No in step S308), the process proceeds to step S313.
ステップS309において、j列目から(j+k)列目までの罫線枠を統合し、罫線枠内の文字列を連結し、連結文字列を得る。そして、知識データベース4を参照し、知識データベース4中のj番目(すなわち、j列目)の項目名に属する適合文字列を順次読み出して、連結文字列がj列目の項目名に属する尤度 [j+k,j]を算出し、尤度L[j+k,j]を変数L1に代入する(ステップS309)。本実施の形態における知識データベース4の参照方法の具体例として、例えば、j=1(すなわち、表の1列目)の場合は、知識データベース4の項目名「Item A」の項目に属する適合文字列を順次読み出し、j=2(すなわち、表の2列目)の場合は、知識データベース4の項目名「Item B」の項目に属する適合文字列を順次読み出す。なお、知識データベース4から適合文字列を読み出すため、インデックス番号IDXを検索キーとして使用することができる。具体的には、尤度 [j+k,j]は、例えば、次のように算出される。まず、j番目(すなわち、j列目)の項目名に属する項目値として、知識データベース4に登録されている一又は複数の適合文字列から、インデックス番号IDXをキーとして1つの適合文字列を読み出す。次に、j列目から(j+k)列目までの罫線枠内の文字列を連結して得られた連結文字列と、インデックス番号IDXをキーとして読み出された適合文字列との標準化編集距離NEDを算出する。標準化編集距離NEDは、知識データベース4に登録されている適合文字列毎に算出される。なお、標準化編集距離NEDは、知識データベース4に登録されている全ての適合文字列について算出されてもよいし、一部の適合文字列について算出されてもよい。続いて、算出された一又は複数の標準化編集距離NEDの中からその最小値NEDMINを求める。そして、1から最小値NEDMINを減算した値を、尤度 [j+k,j]として求めることができる。つまり、尤度は、連結文字列に対し、知識データベース4に登録されている適合文字列の中で最も類似している文字列を選択した時の、2つの文字列の間の一致度を表す。2つの文字列の間の一致度が高い場合、文字列の変形回数は少なくなるので標準化編集距離は小さくなる。よって、j列目から(j+k)列目までの罫線枠内の文字列を連結することで得られた連結文字列と、j番目(すなわち、j列目)の項目名に属する項目値として知識データベース4に登録されている適合文字列との一致度が高い場合、尤度 [j+k,j]は高い値(すなわち、1に近い値)を示し、一致度が低い場合、尤度 [j+k,j]は低い値(すなわち、0に近い値)を示す。
In step S309, the ruled frames from the jth column to the (j+k)th column are integrated, and the character strings in the ruled frames are concatenated to obtain a concatenated character string. Then, the knowledge database 4 is referred to, and the matching character strings belonging to the jth item name (i.e., the jth column) in the knowledge database 4 are sequentially read out, the likelihood [j+k,j] that the concatenated character string belongs to the item name in the jth column is calculated, and the likelihood L[j+k,j] is substituted for the variable L1 (step S309). As a specific example of a method of referring to the knowledge database 4 in this embodiment, for example, when j=1 (i.e., the first column of the table), matching character strings belonging to the item name "Item A" in the knowledge database 4 are sequentially read out, and when j=2 (i.e., the second column of the table), matching character strings belonging to the item name "Item B" in the knowledge database 4 are sequentially read out. Note that in order to read out matching character strings from the knowledge database 4, the index number IDX can be used as a search key. Specifically, the likelihood [j+k,j] is calculated, for example, as follows: First, one matching string is read out from one or more matching strings registered in the knowledge database 4 as an item value belonging to the j-th item name (i.e., the j-th column) using the index number IDX as a key. Next, the standardized edit distance NED between the concatenated string obtained by concatenating the strings in the ruled box from the j-th column to the (j+k)th column and the matching string read out using the index number IDX as a key is calculated. The standardized edit distance NED is calculated for each matching string registered in the knowledge database 4. The standardized edit distance NED may be calculated for all matching strings registered in the knowledge database 4, or may be calculated for some matching strings. Next, the minimum value NED MIN is calculated from the one or more calculated standardized edit distances NED. Then, the value obtained by subtracting the minimum value NED MIN from 1 can be calculated as the likelihood [j+k,j]. In other words, the likelihood represents the degree of match between two strings when the most similar string is selected from the matching strings registered in the knowledge database 4 for the concatenated string. When the degree of match between two character strings is high, the number of times the character strings are transformed is small, and the standardized edit distance is small. Therefore, when the degree of match between a concatenated character string obtained by concatenating character strings in the ruled box from the jth column to the (j+k)th column and a matching character string registered in the knowledge database 4 as an item value belonging to the jth item name (i.e., the jth column) is high, the likelihood [j+k,j] indicates a high value (i.e., a value close to 1), and when the degree of match is low, the likelihood [j+k,j] indicates a low value (i.e., a value close to 0).
また、尤度は、DNN(Deep Neural Network)など公知の機械学習方法を用いて、知識データベース4で学習した学習済みモデルの出力に基づいて算出されるものでもよい。なお、学習済みモデルは、大量の表形式文書から得られた文字列データから作成することができる。具体的には、大量の文字列データを用い、大量の文字列データから複数の文字列を無作為に抽出し、それら文字列を連結して連結文字列を生成する。次に、生成された連結文字列である入力文字列データと、知識データベース4に登録されている適合文字列との尤度(例えば、標準化編集距離)を計算する。続いて、各入力文字列データに対応する尤度を、正解ラベル(又はランキング)として付与し教師データとする。そして、入力文字列データと、教師データを用いて、学習済みモデルの出力である推定尤度が正解ラベルに一致するように機械学習することで、学習済みモデルを作成することができる。連結文字列を学習済みモデルに入力することで推定尤度が出力されるので、知識データベース4を参照すること無く、もしくは知識データベース4を用いなくとも、尤度を直接算出することができる。このことは、知識データベース4に登録されている適合文字列が大量にある場合に特に有効であり、知識データベース4を参照するための計算量と、知識データベース4の適合文字列を記憶するためのメモリ量とを削減することができる。
The likelihood may be calculated based on the output of a trained model trained in the knowledge database 4 using a known machine learning method such as DNN (Deep Neural Network). The trained model can be created from string data obtained from a large amount of tabular documents. Specifically, a large amount of string data is used to randomly extract multiple strings from the large amount of string data, and these strings are concatenated to generate a concatenated string. Next, the likelihood (e.g., standardized edit distance) between the input string data, which is the generated concatenated string, and the matching string registered in the knowledge database 4 is calculated. Next, the likelihood corresponding to each input string data is assigned as a correct answer label (or ranking) and used as training data. Then, the trained model can be created by machine learning using the input string data and the training data so that the estimated likelihood, which is the output of the trained model, matches the correct answer label. Since the estimated likelihood is output by inputting the concatenated string into the trained model, the likelihood can be directly calculated without referring to the knowledge database 4 or without using the knowledge database 4. This is particularly effective when there is a large number of matching strings registered in the knowledge database 4, and it can reduce the amount of calculation required to refer to the knowledge database 4 and the amount of memory required to store the matching strings in the knowledge database 4.
ステップS310において、ステップS309で得られた連結文字列と、当該連結文字列に隣接する罫線枠内の文字列をもう1つ連結した場合の連結文字列とを比較するため、j列目から(j+k+1)列目までの罫線枠を統合し、罫線枠内の文字列を連結し、連結文字列を得る。そして、ステップS309の処理と同様に知識データベース4を参照し、連結文字列がj番目の項目名に属する尤度 [j+k+1,j]を算出し、尤度L[j+k+1,j]を変数L2に代入する(ステップS310)。
In step S310, to compare the concatenated string obtained in step S309 with another concatenated string obtained by concatenating a string in a ruled box adjacent to the concatenated string, the ruled boxes from column j to (j+k+1) are merged, the strings in the ruled boxes are concatenated, and the concatenated string is obtained. Then, as in the process of step S309, the knowledge database 4 is referenced to calculate the likelihood [j+k+1,j] that the concatenated string belongs to the jth item name, and the likelihood L[j+k+1,j] is assigned to variable L2 (step S310).
ステップS311において、変数L1の値が変数L2の値以下であるか確認する(ステップS311)。変数L1の値が変数L2の値以下である場合(ステップS311のYes)、ステップS309で得られた連結文字列の尤度よりも、当該連結文字列にもう一つの文字列を連結する場合の尤度の方が高いため、処理をステップS312に進める。L1の値がL2の値より大きい場合(ステップS311のNo)、処理をステップS313に進める。
In step S311, it is confirmed whether the value of variable L1 is less than or equal to the value of variable L2 (step S311). If the value of variable L1 is less than or equal to the value of variable L2 (Yes in step S311), the likelihood of concatenating another string to the concatenated string is higher than the likelihood of the concatenated string obtained in step S309, so the process proceeds to step S312. If the value of L1 is greater than the value of L2 (No in step S311), the process proceeds to step S313.
ステップS312において、変数kに1を加算し(ステップS312)、処理をステップS307に進める。
In step S312, 1 is added to the variable k (step S312), and the process proceeds to step S307.
ステップS313において、ステップS309で算出された尤度L[j+k,j]が、所定の閾値T1以上であるか確認する(ステップS313)。
尤度L[j+k.j]が所定の閾値T1以上である場合(ステップS313のYes)、処理をステップS314に進める。尤度 [j+k,j]が所定の閾値T1未満の場合(ステップS313のNo)、処理をステップS315に進める。ここで、所定の閾値T1は、罫線枠の統合候補C[j]が増え過ぎるのを抑制(カットオフ)するための閾値である。例えば、所定の閾値T1は、0.5に予め設定することができるが、この値に限ることは無い。 In step S313, it is confirmed whether the likelihood L[j+k, j] calculated in step S309 is equal to or greater than a predetermined threshold T1 (step S313).
If the likelihood L[j+k.j] is equal to or greater than the predetermined threshold T1 (Yes in step S313), the process proceeds to step S314. If the likelihood [j+k,j] is less than the predetermined threshold T1 (No in step S313), the process proceeds to step S315. Here, the predetermined threshold T1 is a threshold for suppressing (cutting off) an excessive increase in the number of ruled frame integration candidates C[j]. For example, the predetermined threshold T1 can be preset to 0.5, but is not limited to this value.
尤度L[j+k.j]が所定の閾値T1以上である場合(ステップS313のYes)、処理をステップS314に進める。尤度 [j+k,j]が所定の閾値T1未満の場合(ステップS313のNo)、処理をステップS315に進める。ここで、所定の閾値T1は、罫線枠の統合候補C[j]が増え過ぎるのを抑制(カットオフ)するための閾値である。例えば、所定の閾値T1は、0.5に予め設定することができるが、この値に限ることは無い。 In step S313, it is confirmed whether the likelihood L[j+k, j] calculated in step S309 is equal to or greater than a predetermined threshold T1 (step S313).
If the likelihood L[j+k.j] is equal to or greater than the predetermined threshold T1 (Yes in step S313), the process proceeds to step S314. If the likelihood [j+k,j] is less than the predetermined threshold T1 (No in step S313), the process proceeds to step S315. Here, the predetermined threshold T1 is a threshold for suppressing (cutting off) an excessive increase in the number of ruled frame integration candidates C[j]. For example, the predetermined threshold T1 can be preset to 0.5, but is not limited to this value.
ステップS314において、罫線枠の統合候補C[j]として、j列目から(j+k)列目までの罫線枠内の文字列を連結して得られた連結文字列と、各罫線枠の行番号と列番号とを、例えば、図示しないメモリMEMに格納する(ステップS314)。
In step S314, the concatenated string obtained by concatenating the strings in the ruled frames from the jth column to the (j+k)th column as a candidate for merging the ruled frames C[j], and the row number and column number of each ruled frame are stored, for example, in a memory MEM (not shown) (step S314).
ステップS315において、変数jに1を加算し(ステップS315)、処理をステップS304に進める。
In step S315, 1 is added to the variable j (step S315), and the process proceeds to step S304.
ステップS316において、メモリMEMを参照し、罫線枠の統合候補C[j]について、統合する罫線枠に重複があるか否かを確認する。例えば、j=1で1列目と2列目の罫線枠が、罫線枠の統合候補C[j]となり、かつ、j=2で2列目と3列目の罫線枠が、罫線枠の統合候補C[j]となったときは、2列目の罫線枠が重複しているため、統合する罫線枠に重複があると判定する。統合する罫線枠に重複がある場合(ステップS316のYes)、処理をステップS317に進める。統合する罫線枠に重複がない場合(ステップS316のNo)、処理をステップS318に進める。なお、ステップS316では、連結する対象となる文字列に文字認識誤りが無いかどうか判定し、文字認識誤りがあると判定された場合、連結する対象となる文字列を、表知識データベース4に登録されている項目値の中から、最も一致度が高かった適合文字列で置き換えてもよい。
In step S316, the memory MEM is referenced to check whether there is any overlap in the ruled line frames to be integrated for the ruled line frame integration candidates C[j]. For example, when j=1 and the ruled line frames of the first and second columns are the ruled line frame integration candidates C[j], and when j=2 and the ruled line frames of the second and third columns are the ruled line frame integration candidates C[j], the ruled line frame of the second column overlaps, so it is determined that there is an overlap in the ruled line frames to be integrated. If there is an overlap in the ruled line frames to be integrated (Yes in step S316), the process proceeds to step S317. If there is no overlap in the ruled line frames to be integrated (No in step S316), the process proceeds to step S318. In step S316, it is determined whether there is a character recognition error in the character string to be concatenated, and if it is determined that there is a character recognition error, the character string to be concatenated may be replaced with the matching character string with the highest degree of match from among the item values registered in the table knowledge database 4.
ステップS317において、罫線枠が重複した罫線枠の統合候補C[j]のうち、重複した統合候補の項目名に属する尤度が、所定の閾値T1より小さい候補をメモリMEMから棄却する(ステップS317)。棄却方法として、例えば、尤度が高い順に所定の候補数を残すことでもよい。
In step S317, among the merging candidates C[j] for overlapping ruled frames, candidates whose likelihood of belonging to the item name of the overlapping merging candidate is less than a predetermined threshold T1 are discarded from the memory MEM (step S317). As a method of rejection, for example, a predetermined number of candidates may be left in order of decreasing likelihood.
ステップS318において、変数iに1を加算し(ステップS318)、処理をステップS302に進める。なお、ステップS302からステップS318までの一連の処理について、表形式文書のすべての罫線枠に対して行う必要はない。例えば、処理の途中で、以降の罫線枠の内容がすべて空欄であることが明らかな場合、ユーザが罫線枠の統合が必要ではないと判断した場合などについては、上記した一連の処理を中止してもよい。
In step S318, 1 is added to the variable i (step S318), and the process proceeds to step S302. Note that it is not necessary to perform the series of processes from step S302 to step S318 on all ruled frames in the table-format document. For example, if it is clear during the process that all subsequent ruled frame contents will be blank, or if the user determines that merging of ruled frames is not necessary, the series of processes described above may be discontinued.
図7は、本実施の形態1の表認識装置の動作例を説明する図である。図7(a)は、認識対象である表形式文書の例である。図7(a)に示した表形式文書には、項目名「Item A」に属する項目値として「Total Fat」、項目名「Item B」に属する項目値として「Saturated Fat」、項目名「Item C」に属する項目値として「25g」と「9g」とが記載されている。図7(b)は、図7(a)に対する表構造認識結果の例である。図7(c)は、図7(b)に対する文字認識結果の例である。図7(d)は、図7(c)を罫線枠の統合判定により得られた文字列認識結果の例である。なお、説明を簡単にするため、項目名「Item A」、「Item B」、及び「Item C」のそれぞれの表の位置は既知とし、また、それぞれの項目名は正しく文字認識されていることとする。以下、図7の項目名の行(1行目)に関する罫線枠統合判定部3の処理の説明は省略し、項目値の先頭行である2行目を新たな1行目と見なして説明する。
FIG. 7 is a diagram for explaining an example of the operation of the table recognition device of the first embodiment. FIG. 7(a) is an example of a tabular document to be recognized. In the tabular document shown in FIG. 7(a), "Total Fat" is written as an item value belonging to the item name "Item A", "Saturated Fat" is written as an item value belonging to the item name "Item B", and "25g" and "9g" are written as item values belonging to the item name "Item C". FIG. 7(b) is an example of a table structure recognition result for FIG. 7(a). FIG. 7(c) is an example of a character recognition result for FIG. 7(b). FIG. 7(d) is an example of a character string recognition result obtained by integrating the ruled frame of FIG. 7(c). For the sake of simplicity, it is assumed that the table positions of the item names "Item A", "Item B", and "Item C" are known, and that the character recognition of each item name has been performed correctly. In the following, we will omit the explanation of the processing of the ruled line integration determination unit 3 for the line (first line) of the item name in Figure 7, and will consider the second line, which is the first line of the item values, to be the new first line.
図7の例では、まず表構造認識部1で、罫線枠501から罫線枠509がそれぞれ認識される。
In the example of Figure 7, first, the table structure recognition unit 1 recognizes ruled frame 501 to ruled frame 509.
次に文字認識部2で、罫線枠501から罫線枠509までのそれぞれの領域内の文字列を認識する。そして、文字認識結果として、文字列510から文字列517が得られる。
Next, character recognition unit 2 recognizes character strings within each area from ruled frame 501 to ruled frame 509. Then, character strings 510 to 517 are obtained as the character recognition results.
ここで、項目名「Item A」に属する項目値「Total Fat」は、2行1列目の罫線枠を超過し、隣接する2行2列目の罫線枠にはみ出している。そのため、この時点における文字認識結果は、文字列513(「Total F」)と、文字列514(「at」)とに分かれている。すなわち、文字列513(「Total F」)は、項目名「Item A」に属する罫線枠504の文字列として認識されている。また、文字列514(「at」)は、項目名「Item B」に属する罫線枠505の文字列として誤って認識されている。
また、項目名「Item B」に属する項目値「Saturated Fat」は、罫線枠508内ではあるが、語頭の「S」が縦の罫線に接触している。そのため、縦の罫線の影響により文字認識誤りが生じ、「S」が「6」に変化している。よって、この時点における文字認識結果は、文字列516(「6oturated Fat」)のように誤って認識されている。更に、項目名「Item C」に属する項目値「9g」が、左詰めにて記入されているため、項目値「6aturated Fat」と「9g」とを連続した文字列とするか否かの判断も必要となっている。 Here, the item value "Total Fat" belonging to the item name "Item A" exceeds the ruled frame in the second row and first column, and protrudes into the adjacent ruled frame in the second row and second column. Therefore, the character recognition result at this point is divided into a character string 513 ("Total F") and a character string 514 ("at"). That is, the character string 513 ("Total F") is recognized as the character string in the ruledframe 504 belonging to the item name "Item A". Furthermore, the character string 514 ("at") is erroneously recognized as the character string in the ruled frame 505 belonging to the item name "Item B".
Also, the item value "Saturated Fat" belonging to the item name "Item B" is within the ruledframe 508, but the initial "S" touches the vertical rule. Therefore, a character recognition error occurs due to the influence of the vertical rule, and the "S" changes to a "6". Therefore, the character recognition result at this point is erroneously recognized as the character string 516 ("6oaturated fat"). Furthermore, because the item value "9g" belonging to the item name "Item C" is entered left justified, it is also necessary to determine whether the item values "6oaturated Fat" and "9g" are a continuous character string.
また、項目名「Item B」に属する項目値「Saturated Fat」は、罫線枠508内ではあるが、語頭の「S」が縦の罫線に接触している。そのため、縦の罫線の影響により文字認識誤りが生じ、「S」が「6」に変化している。よって、この時点における文字認識結果は、文字列516(「6oturated Fat」)のように誤って認識されている。更に、項目名「Item C」に属する項目値「9g」が、左詰めにて記入されているため、項目値「6aturated Fat」と「9g」とを連続した文字列とするか否かの判断も必要となっている。 Here, the item value "Total Fat" belonging to the item name "Item A" exceeds the ruled frame in the second row and first column, and protrudes into the adjacent ruled frame in the second row and second column. Therefore, the character recognition result at this point is divided into a character string 513 ("Total F") and a character string 514 ("at"). That is, the character string 513 ("Total F") is recognized as the character string in the ruled
Also, the item value "Saturated Fat" belonging to the item name "Item B" is within the ruled
続いて、罫線枠統合判定部3で、知識データベース4に登録されている適合文字列(すなわち、項目値として記述されるべき文字列)を参照し、各罫線枠内の項目値(すなわち、文字認識により得られた文字列)が、項目名に属する尤度(すなわち、標準化編集距離の最小値を1から減算した値)に基づいて、どの罫線枠を統合すべきか判定する。言い換えれば、各罫線枠内の文字列が、隣接する左右の罫線枠内の文字列と連結されるべきか判定する。そして、判定結果に基づいて罫線枠内の文字列を連結し、連結した文字列である連結文字列、もしくは連結しない単独の文字列である単独文字列を、最終的な統合判断結果であり認識結果でもある統合文字列として出力する。
Next, the ruled line frame integration judgment unit 3 refers to the matching strings (i.e., strings that should be written as item values) registered in the knowledge database 4, and judges which ruled line frames should be integrated based on the likelihood that the item value in each ruled line frame (i.e., the string obtained by character recognition) belongs to the item name (i.e., the minimum standardized edit distance minus 1). In other words, it judges whether the string in each ruled line frame should be concatenated with the strings in the adjacent left and right ruled line frames. Then, based on the judgment result, it concatenates the strings in the ruled line frames, and outputs the concatenated string, which is the concatenated string, or the single string, which is a single string that is not concatenated, as the integrated string, which is both the final integration judgment result and the recognition result.
図6のフローチャートに示した処理を適宜参照しながら、罫線枠統合判定部3の具体的な動作を説明する。まず、表の項目値の1行目(i=1)において、罫線枠504(すなわち、文字列513(「Total F」))と、罫線枠505(すなわち、文字列514(「at」))と、罫線枠506(すなわち、文字列515「25g」)と、知識データベース4に登録されている適合文字列とを評価する場合を考える。
説明を簡単にするため、知識データベース4の項目名「Item A」に属する項目値「Total Fat」、項目「B」に属する項目値「Trans Fat」、及び、項目名「Item C」に属する「[*NUM*]g」の場合についてのみ述べる。 6, a specific operation of the ruled line frameintegration determination unit 3 will be described. First, consider a case where, in the first row (i=1) of the item value of the table, ruled line frame 504 (i.e., character string 513 ("Total F")), ruled line frame 505 (i.e., character string 514 ("at")), ruled line frame 506 (i.e., character string 515 "25g"), and a matching character string registered in the knowledge database 4 are evaluated.
For simplicity, only the case of the item value “Total Fat” belonging to the item name “Item A” in theknowledge database 4, the item value “Trans Fat” belonging to the item name “Item B”, and "[*NUM*]g” belonging to the item name “Item C” will be described.
説明を簡単にするため、知識データベース4の項目名「Item A」に属する項目値「Total Fat」、項目「B」に属する項目値「Trans Fat」、及び、項目名「Item C」に属する「[*NUM*]g」の場合についてのみ述べる。 6, a specific operation of the ruled line frame
For simplicity, only the case of the item value “Total Fat” belonging to the item name “Item A” in the
まず、変数i=1、変数j=1のとき、jが項目数以下であるか判定する(ステップS304)。2行目の項目数は3のため、jは項目数以下であり(ステップS304のYes)、j列目の罫線枠内に文字認識結果があるか判定する(ステップS305)。変数i=1、変数j=1のとき、j列目の罫線枠(すなわち、罫線枠504)に文字列513が存在するので(ステップS305のYes)、変数kに0が代入される(ステップS306)。ステップS307において、2行目の項目数は3のため、項目数から(j+1)を減算した値はk(=0)より大きいので(ステップS307のYes)、ステップS308に進む。
次に、(j+k)列目の罫線枠内に文字認識結果があるか判定する(ステップS308)。(j+k)列目の罫線枠(k=0であり、すなわち、罫線枠504)には文字列513が存在するので(ステップS308のYes)、変数L1が計算され(ステップS309)、変数L2が計算される(ステップS310)。ここで、変数L1は、罫線枠504内の文字列513「Total F」と、知識データベース4の項目名「Item A」に属する項目値「Total Fat」との尤度であり、文字列513「Total F」と、適合文字列「Total Fat」との標準化編集距離NEDから算出することができる。また、変数L2は、罫線枠504と罫線枠505とを連結して得られた連結文字列である「Total Fat」と、知識データベース4の項目名「Item A」に属する項目値「Total Fat」との尤度であり、連結文字列「Total Fat」と、適合文字列「Total Fat」との標準化編集距離NEDから算出することができる。続いて、変数L1と変数L2とが比較される(ステップS311)。
ここで、標準化編集距離NEDの算出において、文字列「Total F」を、項目値「Total Fat」に変換する場合、2文字の置換が必要である。「Total Fat」の文字列の長さは、空白文字を含み、9となる。よって、変数L1は、1-(2/9)=0.788である。一方、連結文字列「Total Fat」を、項目値「Total Fat」に変換する場合、置換無し(すなわち、0文字の置換)である。よって、変数L2は、1-(0/9)=1.0となる。変数L1と変数L2を比較した結果、L1<L2であり(ステップS311のYes)、変数kに1が加えられる(ステップS312)。そして、ステップS307に戻る。 First, when variable i=1 and variable j=1, it is determined whether j is equal to or less than the number of items (step S304). Since the number of items in the second row is 3, j is equal to or less than the number of items (Yes in step S304), and it is determined whether the character recognition result is present in the ruled frame in the jth column (step S305). When variable i=1 and variable j=1,character string 513 is present in the ruled frame in the jth column (i.e., ruled frame 504) (Yes in step S305), so 0 is substituted for variable k (step S306). In step S307, since the number of items in the second row is 3, the value obtained by subtracting (j+1) from the number of items is greater than k (=0) (Yes in step S307), so the process proceeds to step S308.
Next, it is determined whether the character recognition result is present in the ruled box in the (j+k)th column (step S308). Since thecharacter string 513 exists in the ruled box in the (j+k)th column (k=0, i.e., the ruled box 504) (Yes in step S308), a variable L1 is calculated (step S309), and a variable L2 is calculated (step S310). Here, the variable L1 is the likelihood between the character string 513 "Total F" in the ruled box 504 and the item value "Total Fat" belonging to the item name "Item A" in the knowledge database 4, and can be calculated from the standardized edit distance NED between the character string 513 "Total F" and the matching character string "Total Fat". Moreover, variable L2 is the likelihood between the concatenated character string "Total Fat" obtained by concatenating ruled box 504 and ruled box 505 and the item value "Total Fat" belonging to the item name "Item A" in knowledge database 4, and can be calculated from the standardized edit distance NED between the concatenated character string "Total Fat" and the matching character string "Total Fat". Next, variable L1 and variable L2 are compared (step S311).
Here, in the calculation of the standardized edit distance NED, when converting the character string "Total F" to the item value "Total Fat", replacement of two characters is necessary. The length of the character string "Total Fat" is 9, including the blank character. Therefore, the variable L1 is 1-(2/9)=0.788. On the other hand, when converting the concatenated character string "Total Fat" to the item value "Total Fat", there is no replacement (i.e., replacement of 0 characters). Therefore, the variable L2 is 1-(0/9)=1.0. As a result of comparing the variables L1 and L2, L1<L2 (Yes in step S311), and 1 is added to the variable k (step S312). Then, the process returns to step S307.
次に、(j+k)列目の罫線枠内に文字認識結果があるか判定する(ステップS308)。(j+k)列目の罫線枠(k=0であり、すなわち、罫線枠504)には文字列513が存在するので(ステップS308のYes)、変数L1が計算され(ステップS309)、変数L2が計算される(ステップS310)。ここで、変数L1は、罫線枠504内の文字列513「Total F」と、知識データベース4の項目名「Item A」に属する項目値「Total Fat」との尤度であり、文字列513「Total F」と、適合文字列「Total Fat」との標準化編集距離NEDから算出することができる。また、変数L2は、罫線枠504と罫線枠505とを連結して得られた連結文字列である「Total Fat」と、知識データベース4の項目名「Item A」に属する項目値「Total Fat」との尤度であり、連結文字列「Total Fat」と、適合文字列「Total Fat」との標準化編集距離NEDから算出することができる。続いて、変数L1と変数L2とが比較される(ステップS311)。
ここで、標準化編集距離NEDの算出において、文字列「Total F」を、項目値「Total Fat」に変換する場合、2文字の置換が必要である。「Total Fat」の文字列の長さは、空白文字を含み、9となる。よって、変数L1は、1-(2/9)=0.788である。一方、連結文字列「Total Fat」を、項目値「Total Fat」に変換する場合、置換無し(すなわち、0文字の置換)である。よって、変数L2は、1-(0/9)=1.0となる。変数L1と変数L2を比較した結果、L1<L2であり(ステップS311のYes)、変数kに1が加えられる(ステップS312)。そして、ステップS307に戻る。 First, when variable i=1 and variable j=1, it is determined whether j is equal to or less than the number of items (step S304). Since the number of items in the second row is 3, j is equal to or less than the number of items (Yes in step S304), and it is determined whether the character recognition result is present in the ruled frame in the jth column (step S305). When variable i=1 and variable j=1,
Next, it is determined whether the character recognition result is present in the ruled box in the (j+k)th column (step S308). Since the
Here, in the calculation of the standardized edit distance NED, when converting the character string "Total F" to the item value "Total Fat", replacement of two characters is necessary. The length of the character string "Total Fat" is 9, including the blank character. Therefore, the variable L1 is 1-(2/9)=0.788. On the other hand, when converting the concatenated character string "Total Fat" to the item value "Total Fat", there is no replacement (i.e., replacement of 0 characters). Therefore, the variable L2 is 1-(0/9)=1.0. As a result of comparing the variables L1 and L2, L1<L2 (Yes in step S311), and 1 is added to the variable k (step S312). Then, the process returns to step S307.
ステップS307において、2行目の項目数は3のため、項目数から(j+1)を減算した値はk(=1)と等しいので(ステップS307のYes)、ステップS308に進む。
上記と同様に、(j+k)列目の罫線枠内に文字認識結果があるか判定する(ステップS308)。(j+k)列目の罫線枠(すなわち、罫線枠505)には文字列514「at」が存在するので(ステップS305のYes)、変数L1が計算され(ステップS309)、変数L2が計算される(ステップS310)。ここで、変数L1は、罫線枠504と罫線枠505とを連結して得られた連結文字列である「Total Fat」と、知識データベース4の項目名「Item A」に属する項目値「Total Fat」との尤度であり、標準化編集距離NEDから算出することができる。また、変数L2は、罫線枠504と罫線枠505、及び罫線枠506とを連結して得られた連結文字列である「Total Fat 25g」と、知識データベース4の項目名「Item B」に属する項目値「Total Fat」との尤度であり、標準化編集距離NEDから算出することができる。続いて、変数L1と変数L2とが比較される(ステップS311)。
ここで、標準化編集距離NEDの算出において、連結文字列「Total Fat」を、項目値「Total Fat」に変換する場合、置換無し(すなわち、0文字の置換)である。よって、変数L1は、1-(0/9)=1.0となる。一方、連結文字列「Total Fat 25g」を、項目値「Total Fat」に変換する場合、4文字の置換が必要である。よって、変数L2は、1-4/9=0.556となる。変数L1と変数L2を比較した結果、L1>L2であり(ステップS311のNo)、ステップS313に進む。
ステップS313において、直前で計算された変数L1は、所定の閾値T1=0.5よりも大きいので(ステップS313のYes)、罫線枠504と罫線枠505とが、罫線枠の統合候補C[j]となる(ステップS314)。そして、変数jに1が加えられ(ステップS315)、ステップS304の先頭に戻る。 In step S307, since the number of items in the second row is 3, the value obtained by subtracting (j+1) from the number of items is equal to k (=1) (Yes in step S307), and the process proceeds to step S308.
As in the above, it is determined whether the character recognition result is present in the ruled box in the (j+k)th column (step S308). Since thecharacter string 514 "at" is present in the ruled box in the (j+k)th column (i.e., the ruled box 505) (Yes in step S305), the variable L1 is calculated (step S309), and the variable L2 is calculated (step S310). Here, the variable L1 is the likelihood between the concatenated character string "Total Fat" obtained by concatenating the ruled box 504 and the ruled box 505, and the item value "Total Fat" belonging to the item name "Item A" in the knowledge database 4, and can be calculated from the standardized edit distance NED. Furthermore, variable L2 is the likelihood between "Total Fat 25g", which is a concatenated character string obtained by concatenating ruled box 504, ruled box 505, and ruled box 506, and item value "Total Fat" belonging to item name "Item B" in knowledge database 4, and can be calculated from the standardized edit distance NED. Next, variable L1 and variable L2 are compared (step S311).
Here, in the calculation of the standardized edit distance NED, when converting the concatenated character string "Total Fat" to the item value "Total Fat", no replacement is performed (i.e., 0 characters are replaced). Therefore, the variable L1 is 1-(0/9)=1.0. On the other hand, when converting the concatenated character string "Total Fat 25g" to the item value "Total Fat", replacement of 4 characters is required. Therefore, the variable L2 is 1-4/9=0.556. As a result of comparing the variables L1 and L2, L1>L2 (No in step S311), and the process proceeds to step S313.
In step S313, the previously calculated variable L1 is greater than a predetermined threshold T1=0.5 (Yes in step S313), so ruledframe 504 and ruled frame 505 become a candidate for merging ruled frame C[j] (step S314). Then, 1 is added to variable j (step S315), and the process returns to the beginning of step S304.
上記と同様に、(j+k)列目の罫線枠内に文字認識結果があるか判定する(ステップS308)。(j+k)列目の罫線枠(すなわち、罫線枠505)には文字列514「at」が存在するので(ステップS305のYes)、変数L1が計算され(ステップS309)、変数L2が計算される(ステップS310)。ここで、変数L1は、罫線枠504と罫線枠505とを連結して得られた連結文字列である「Total Fat」と、知識データベース4の項目名「Item A」に属する項目値「Total Fat」との尤度であり、標準化編集距離NEDから算出することができる。また、変数L2は、罫線枠504と罫線枠505、及び罫線枠506とを連結して得られた連結文字列である「Total Fat 25g」と、知識データベース4の項目名「Item B」に属する項目値「Total Fat」との尤度であり、標準化編集距離NEDから算出することができる。続いて、変数L1と変数L2とが比較される(ステップS311)。
ここで、標準化編集距離NEDの算出において、連結文字列「Total Fat」を、項目値「Total Fat」に変換する場合、置換無し(すなわち、0文字の置換)である。よって、変数L1は、1-(0/9)=1.0となる。一方、連結文字列「Total Fat 25g」を、項目値「Total Fat」に変換する場合、4文字の置換が必要である。よって、変数L2は、1-4/9=0.556となる。変数L1と変数L2を比較した結果、L1>L2であり(ステップS311のNo)、ステップS313に進む。
ステップS313において、直前で計算された変数L1は、所定の閾値T1=0.5よりも大きいので(ステップS313のYes)、罫線枠504と罫線枠505とが、罫線枠の統合候補C[j]となる(ステップS314)。そして、変数jに1が加えられ(ステップS315)、ステップS304の先頭に戻る。 In step S307, since the number of items in the second row is 3, the value obtained by subtracting (j+1) from the number of items is equal to k (=1) (Yes in step S307), and the process proceeds to step S308.
As in the above, it is determined whether the character recognition result is present in the ruled box in the (j+k)th column (step S308). Since the
Here, in the calculation of the standardized edit distance NED, when converting the concatenated character string "Total Fat" to the item value "Total Fat", no replacement is performed (i.e., 0 characters are replaced). Therefore, the variable L1 is 1-(0/9)=1.0. On the other hand, when converting the concatenated character string "
In step S313, the previously calculated variable L1 is greater than a predetermined threshold T1=0.5 (Yes in step S313), so ruled
続いて、変数i=1、変数j=2のとき、jが項目数以下であるか判定する(ステップS304)。2行目の項目数は3のため、jは項目数以下であり(ステップS304のYes)、j列目の罫線枠内に文字認識結果があるか判定する(ステップS305)。変数i=1、変数j=2のとき、j列目の罫線枠(すなわち、罫線枠505)に文字列514が存在するので(ステップS305のYes)、変数kに0が代入される(ステップS306)。ステップS307において、2行目の項目数は3のため、項目数から(j+1)を減算した値はk(=0)より大きいので(ステップS307のYes)、ステップS308に進む。
次に、(j+k)列目の罫線枠内に文字認識結果があるか判定する(ステップS308)。(j+k)列目の罫線枠(k=0であり、すなわち、罫線枠505)には文字列413が存在するので(ステップS308のYes)、変数L1が計算され(ステップS309)、変数L2が計算される(ステップS310)。ここで、変数L1は、罫線枠505内の文字列514「at」と、知識データベース4の項目名「Item B」に属する「Trans Fat」との尤度であり、標準化編集距離NEDから算出することができる。また、変数L2は、罫線枠505内の文字列と罫線枠506内の文字列とを連結して得られた連結文字列である「at 25g」と、知識データベース4の項目名「Item B」に属する項目値「Trans Fat」との尤度であり、標準化編集距離NEDから算出することができる。続いて、変数L1と変数L2とが比較される(ステップS311)。
ここで、標準化編集距離NEDの算出において、文字列「at」を、項目値「Trans Fat」に変換する場合、7文字の置換が必要である。「Trans Fat」の文字列の長さは、空白文字を含み、9である。よって、変数L1は、1-(7/9)=0.222となる。一方、連結文字列「at 25g」を、項目値「Trans Fat」に変換する場合、11文字の置換が必要である。よって、変数L2は、1-11/9=0.0(0以下の場合は、0に制限する)となる。変数L1と変数L2を比較した結果、L1>L2であり(ステップS311のNo)、ステップS313に進む。
ステップS313において、直前で計算された変数L1は、所定の閾値T1=0.5よりも小さいので(ステップS313のNo)、罫線枠の統合候補C[j]とはならず、ステップS315へ進み、そして、変数jに1が加えられ(ステップS315)、ステップS304の先頭に戻る。 Next, when variable i=1 and variable j=2, it is determined whether j is equal to or less than the number of items (step S304). Since the number of items in the second row is 3, j is equal to or less than the number of items (Yes in step S304), and it is determined whether the character recognition result is present in the ruled frame in the jth column (step S305). When variable i=1 and variable j=2,character string 514 is present in the ruled frame in the jth column (i.e., ruled frame 505) (Yes in step S305), so 0 is substituted for variable k (step S306). In step S307, since the number of items in the second row is 3, the value obtained by subtracting (j+1) from the number of items is greater than k (=0) (Yes in step S307), so the process proceeds to step S308.
Next, it is determined whether the character recognition result is present in the ruled frame of the (j+k)th column (step S308). Since the character string 413 exists in the ruled frame of the (j+k)th column (k=0, i.e., the ruled frame 505) (Yes in step S308), a variable L1 is calculated (step S309), and a variable L2 is calculated (step S310). Here, the variable L1 is the likelihood between thecharacter string 514 "at" in the ruled frame 505 and "Trans Fat" belonging to the item name "Item B" in the knowledge database 4, and can be calculated from the standardized edit distance NED. Moreover, the variable L2 is the likelihood between the concatenated character string "at 25g" obtained by concatenating the character string in the ruled frame 505 and the character string in the ruled frame 506, and the item value "Trans Fat" belonging to the item name "Item B" in the knowledge database 4, and can be calculated from the standardized edit distance NED. Next, the variables L1 and L2 are compared (step S311).
Here, in the calculation of the standardized edit distance NED, when converting the character string "at" to the item value "Trans Fat", seven characters need to be replaced. The length of the character string "Trans Fat" is 9, including blank characters. Therefore, the variable L1 is 1-(7/9)=0.222. On the other hand, when converting the concatenated character string "at 25g" to the item value "Trans Fat", 11 characters need to be replaced. Therefore, the variable L2 is 1-11/9=0.0 (if 0 or less, it is limited to 0). As a result of comparing the variables L1 and L2, L1>L2 (No in step S311), and the process proceeds to step S313.
In step S313, since the previously calculated variable L1 is smaller than the predetermined threshold value T1 = 0.5 (No in step S313), it is not a candidate for merging ruled line frames C[j], and the process proceeds to step S315. Then, 1 is added to the variable j (step S315), and the process returns to the beginning of step S304.
次に、(j+k)列目の罫線枠内に文字認識結果があるか判定する(ステップS308)。(j+k)列目の罫線枠(k=0であり、すなわち、罫線枠505)には文字列413が存在するので(ステップS308のYes)、変数L1が計算され(ステップS309)、変数L2が計算される(ステップS310)。ここで、変数L1は、罫線枠505内の文字列514「at」と、知識データベース4の項目名「Item B」に属する「Trans Fat」との尤度であり、標準化編集距離NEDから算出することができる。また、変数L2は、罫線枠505内の文字列と罫線枠506内の文字列とを連結して得られた連結文字列である「at 25g」と、知識データベース4の項目名「Item B」に属する項目値「Trans Fat」との尤度であり、標準化編集距離NEDから算出することができる。続いて、変数L1と変数L2とが比較される(ステップS311)。
ここで、標準化編集距離NEDの算出において、文字列「at」を、項目値「Trans Fat」に変換する場合、7文字の置換が必要である。「Trans Fat」の文字列の長さは、空白文字を含み、9である。よって、変数L1は、1-(7/9)=0.222となる。一方、連結文字列「at 25g」を、項目値「Trans Fat」に変換する場合、11文字の置換が必要である。よって、変数L2は、1-11/9=0.0(0以下の場合は、0に制限する)となる。変数L1と変数L2を比較した結果、L1>L2であり(ステップS311のNo)、ステップS313に進む。
ステップS313において、直前で計算された変数L1は、所定の閾値T1=0.5よりも小さいので(ステップS313のNo)、罫線枠の統合候補C[j]とはならず、ステップS315へ進み、そして、変数jに1が加えられ(ステップS315)、ステップS304の先頭に戻る。 Next, when variable i=1 and variable j=2, it is determined whether j is equal to or less than the number of items (step S304). Since the number of items in the second row is 3, j is equal to or less than the number of items (Yes in step S304), and it is determined whether the character recognition result is present in the ruled frame in the jth column (step S305). When variable i=1 and variable j=2,
Next, it is determined whether the character recognition result is present in the ruled frame of the (j+k)th column (step S308). Since the character string 413 exists in the ruled frame of the (j+k)th column (k=0, i.e., the ruled frame 505) (Yes in step S308), a variable L1 is calculated (step S309), and a variable L2 is calculated (step S310). Here, the variable L1 is the likelihood between the
Here, in the calculation of the standardized edit distance NED, when converting the character string "at" to the item value "Trans Fat", seven characters need to be replaced. The length of the character string "Trans Fat" is 9, including blank characters. Therefore, the variable L1 is 1-(7/9)=0.222. On the other hand, when converting the concatenated character string "at 25g" to the item value "Trans Fat", 11 characters need to be replaced. Therefore, the variable L2 is 1-11/9=0.0 (if 0 or less, it is limited to 0). As a result of comparing the variables L1 and L2, L1>L2 (No in step S311), and the process proceeds to step S313.
In step S313, since the previously calculated variable L1 is smaller than the predetermined threshold value T1 = 0.5 (No in step S313), it is not a candidate for merging ruled line frames C[j], and the process proceeds to step S315. Then, 1 is added to the variable j (step S315), and the process returns to the beginning of step S304.
最後に、変数i=1、変数j=3のとき、jが項目数以下であるか判定する(ステップS304)。2行目の項目数は3のため、jは項目数以下であり(ステップS304のYes)、j列目の罫線枠内に文字認識結果があるか判定する(ステップS305)。変数i=1、変数j=3のとき、j列目の罫線枠(すなわち、罫線枠506)に文字列514が存在するので(ステップS305のYes)、変数kに0が代入される(ステップS306)。ステップS307において、2行目の項目数は3のため、項目数から(j+1)を減算した値はk(=0)より小さいので(ステップS307のNo)、ステップS313に進む。
ステップS313において、変数L1が計算される。ここで、変数L1は、罫線枠506内の文字列515「25g」と、知識データベース4の項目名「Item C」に属する項目名「[*NUM*]g」との尤度であり、標準化編集距離NEDから算出することができる。
ここで、標準化編集距離NEDの算出において、文字列「25g」を、項目値「[*NUM*]g」に変換する場合、[*NUM*]はワイルドカードであり任意の数値が入るので、置換無しである。よって、変数L1は、1.0であり、所定の閾値T1=0.5よりも大きいので(ステップS313のYes)、統合されていない単独の罫線枠506が、罫線枠の統合候補C[j]となる(ステップS314)。 Finally, when variable i=1 and variable j=3, it is determined whether j is equal to or less than the number of items (step S304). Since the number of items in the second row is 3, j is equal to or less than the number of items (Yes in step S304), and it is determined whether the character recognition result is present in the ruled frame in the jth column (step S305). When variable i=1 and variable j=3,character string 514 is present in the ruled frame in the jth column (i.e., ruled frame 506) (Yes in step S305), so 0 is substituted for variable k (step S306). In step S307, since the number of items in the second row is 3, the value obtained by subtracting (j+1) from the number of items is less than k (=0) (No in step S307), so the process proceeds to step S313.
In step S313, a variable L1 is calculated. Here, the variable L1 is the likelihood between thecharacter string 515 “25g” in the ruled box 506 and the item name "[*NUM*]g" belonging to the item name “Item C” in the knowledge database 4, and can be calculated from the standardized edit distance NED.
Here, when converting the character string "25g" to the item value "[*NUM*]g" in the calculation of the standardized edit distance NED, no replacement is performed because [*NUM*] is a wildcard and can take any numeric value. Therefore, the variable L1 is 1.0, which is greater than the predetermined threshold T1=0.5 (Yes in step S313), and the unintegrated single ruledframe 506 becomes the ruled frame integration candidate C[j] (step S314).
ステップS313において、変数L1が計算される。ここで、変数L1は、罫線枠506内の文字列515「25g」と、知識データベース4の項目名「Item C」に属する項目名「[*NUM*]g」との尤度であり、標準化編集距離NEDから算出することができる。
ここで、標準化編集距離NEDの算出において、文字列「25g」を、項目値「[*NUM*]g」に変換する場合、[*NUM*]はワイルドカードであり任意の数値が入るので、置換無しである。よって、変数L1は、1.0であり、所定の閾値T1=0.5よりも大きいので(ステップS313のYes)、統合されていない単独の罫線枠506が、罫線枠の統合候補C[j]となる(ステップS314)。 Finally, when variable i=1 and variable j=3, it is determined whether j is equal to or less than the number of items (step S304). Since the number of items in the second row is 3, j is equal to or less than the number of items (Yes in step S304), and it is determined whether the character recognition result is present in the ruled frame in the jth column (step S305). When variable i=1 and variable j=3,
In step S313, a variable L1 is calculated. Here, the variable L1 is the likelihood between the
Here, when converting the character string "25g" to the item value "[*NUM*]g" in the calculation of the standardized edit distance NED, no replacement is performed because [*NUM*] is a wildcard and can take any numeric value. Therefore, the variable L1 is 1.0, which is greater than the predetermined threshold T1=0.5 (Yes in step S313), and the unintegrated single ruled
以上、表の項目名の1行目(変数i=1)に属する全ての罫線枠を評価し、罫線枠504と罫線枠505の組と、単独の罫線枠505が罫線枠の統合候補C[j]として得られた。そして、ステップS316において、統合する罫線枠に重複が有るか否かを判定する。統合する罫線枠に重複が無いので(ステップS316のNo)、文字列513「Total F」と文字列514「at」とに分割されて認識された文字列が、連結文字列「Total Fat」として一つの文字列に連結される。
As described above, all ruled line frames belonging to the first row (variable i = 1) of the table item names are evaluated, and the pair of ruled line frames 504 and 505, and single ruled line frame 505 are obtained as candidates for ruled line frame merging C[j]. Then, in step S316, it is determined whether there is any overlap in the ruled line frames to be merged. Since there is no overlap in the ruled line frames to be merged (No in step S316), the character strings that were recognized after being split into character string 513 "Total F" and character string 514 "at" are concatenated into a single character string as the concatenated character string "Total Fat".
続いて、ステップS316において、連結文字列もしくは単独文字列に文字認識誤りが無いかどうか、尤度を用いて判定する。文字列513「Total F」と文字列514「at」とを連結した連結文字列の尤度は1.0(すなわち、項目値に記載されるべき文字列と完全一致)であるので、連結文字列は正しく認識されたと判断され、文字列521(「Total Fat」)が、最終的な統合判断結果であり認識結果でもある統合文字列として出力される。また、文字列515は、連結しない単独の文字列である単独文字列として取り扱われる。この時の尤度は1.0であるので、文字列513と文字列514と同様、文字列515は正しく認識されたと判断され、文字列522(「25g」)が、最終的な統合判断結果であり認識結果でもある統合文字列として出力される。
Next, in step S316, the likelihood is used to determine whether there is a character recognition error in the concatenated string or the individual string. Since the likelihood of the concatenated string formed by concatenating strings 513 "Total F" and 514 "at" is 1.0 (i.e., a perfect match with the string that should be entered in the item value), the concatenated string is determined to have been recognized correctly, and string 521 ("Total Fat") is output as the integrated string, which is both the final integrated judgment result and the recognition result. Furthermore, string 515 is treated as an individual string that is a single string that is not concatenated. Since the likelihood is 1.0 in this case, string 515 is determined to have been recognized correctly, as are strings 513 and 514, and string 522 ("25g") is output as the integrated string, which is both the final integrated judgment result and the recognition result.
次に、表の項目値の2行目(i=2)において、罫線枠508(すなわち、文字列516(「6aturated Fat」))と、罫線枠509(すなわち、文字列517(「9g」))と、知識データベース4に登録されている適合文字列とを評価する場合について考える。
説明を簡単にするため、知識データベース4の項目「Item B」に属する項目値「Saturated Fat」、項目名「Item C」に属する「[*NUM*]g」の場合についてのみ述べる。 Next, consider the case where, in the second row (i=2) of the item values in the table, lined box 508 (i.e., character string 516 ("6aturated Fat")), lined box 509 (i.e., character string 517 ("9g")), and matching character strings registered inknowledge database 4 are evaluated.
For simplicity, only the case of the item value "Saturated Fat" belonging to the item "Item B" in theknowledge database 4 and "[*NUM*]g" belonging to the item name "Item C" will be described.
説明を簡単にするため、知識データベース4の項目「Item B」に属する項目値「Saturated Fat」、項目名「Item C」に属する「[*NUM*]g」の場合についてのみ述べる。 Next, consider the case where, in the second row (i=2) of the item values in the table, lined box 508 (i.e., character string 516 ("6aturated Fat")), lined box 509 (i.e., character string 517 ("9g")), and matching character strings registered in
For simplicity, only the case of the item value "Saturated Fat" belonging to the item "Item B" in the
まず、変数i=2、変数j=1のとき、jが項目数以下であるか判定する(ステップS304)。3行目の項目数は3のため、jは項目数以下であり(ステップS304のYes)、j列目の罫線枠内に文字認識結果があるか判定する(ステップS305)。変数i=2、変数j=1のとき、j列目の罫線枠(すなわち、罫線枠507)に文字列は存在しないので(ステップS305のNo)、変数jに1が加えられ(ステップS315)、ステップS304の先頭に戻る。
続いて、変数i=2、変数j=2のとき、jが項目数以下であるか判定する(ステップS304)。3行目の項目数は3のため、jは項目数以下であり(ステップS304のYes)、j列目の罫線枠内に文字認識結果があるか判定する(ステップS305)。変数i=2、変数j=2のとき、j列目の罫線枠(すなわち、罫線枠508)に文字列516が存在するので(ステップS305のYes)、変数kに0が代入される(ステップS306)。ステップS307において、3行目の項目数は3のため、項目数から(j+1)を減算した値はk(=0)と等しいので(ステップS307のYes)、ステップS308に進む。
次に、(j+k)列目の罫線枠内に文字認識結果があるか判定する(ステップS308)。(j+k)列目の罫線枠(k=0であり、すなわち、罫線枠508)には文字列516が存在するので(ステップS308のYes)、変数L1が計算され(ステップS309)、変数L2が計算される(ステップS310)。ここで、変数L1は、罫線枠508内の文字列516「6aturated Fat」と、知識データベース4の項目名「Item B」に属する項目値「Saturated Fat」との尤度であり、標準化編集距離NEDから算出することができる。また、変数L2は、罫線枠508内の文字列と罫線枠509内の文字列とを連結して得られた連結文字列である「6aturated Fat9g」と、知識データベース4の項目名「Item B」に属する項目値「Saturated Fat」との尤度であり、標準化編集距離NEDから算出することができる。続いて、変数L1と変数L2とが比較される(ステップS311)。
ここで、標準化編集距離NEDの算出において、文字列「6aturated Fat」を、項目値「Saturated Fat」に変換する場合、1文字の置換が必要である。「Saturated Fat」の文字列の長さは、空白文字を含み、13である。よって、変数L1は、1-(1/13)=0.923となる。一方、連結文字列「6aturated Fat9g」を、項目値「Saturated Fat」に変換する場合、3文字の置換が必要である。「6aturated Fat9g」の文字列の長さは、空白文字を含み、15である。よって、変数L2は、1-3/15=0.8となる。変数L1と変数L2を比較した結果、L1>L2であり(ステップS311のNo)、ステップS313に進む。
ステップS313において、直前で計算された変数L1は、所定の閾値T1=0.5よりも大きいので(ステップS313のYes)、統合されていない単独の罫線枠508が、罫線枠の統合候補C[j]となる(ステップS314)。
そして、変数jに1が加えられ(ステップS315)、ステップS304の先頭に戻る。 First, when variable i=2 and variable j=1, it is determined whether j is equal to or less than the number of items (step S304). Since the number of items in the third row is 3, j is equal to or less than the number of items (Yes in step S304), and it is determined whether a character recognition result is present within the ruled frame in the jth column (step S305). When variable i=2 and variable j=1, no character string is present in the ruled frame in the jth column (i.e., ruled frame 507) (No in step S305), so 1 is added to variable j (step S315) and the process returns to the beginning of step S304.
Next, when variable i=2 and variable j=2, it is determined whether j is equal to or less than the number of items (step S304). Since the number of items in the third row is 3, j is equal to or less than the number of items (Yes in step S304), and it is determined whether the character recognition result is present in the ruled frame in the jth column (step S305). When variable i=2 and variable j=2,character string 516 is present in the ruled frame in the jth column (i.e., ruled frame 508) (Yes in step S305), so 0 is substituted for variable k (step S306). In step S307, since the number of items in the third row is 3, the value obtained by subtracting (j+1) from the number of items is equal to k (=0) (Yes in step S307), so the process proceeds to step S308.
Next, it is determined whether the character recognition result is present in the ruled box in the (j+k)th column (step S308). Since thecharacter string 516 is present in the ruled box in the (j+k)th column (k=0, i.e., the ruled box 508) (Yes in step S308), a variable L1 is calculated (step S309), and a variable L2 is calculated (step S310). Here, the variable L1 is the likelihood between the character string 516 "Saturated Fat" in the ruled box 508 and the item value "Saturated Fat" belonging to the item name "Item B" in the knowledge database 4, and can be calculated from the standardized edit distance NED. Furthermore, variable L2 is the likelihood between "6aturated Fat9g", which is a concatenated character string obtained by concatenating the character string in ruled box 508 and the character string in ruled box 509, and the item value "Saturated Fat" belonging to the item name "Item B" in knowledge database 4, and can be calculated from the standardized edit distance NED. Next, variables L1 and L2 are compared (step S311).
Here, in the calculation of the standardized edit distance NED, when converting the character string "6aturated Fat" to the item value "Saturated Fat", one character needs to be replaced. The length of the character string "Saturated Fat" is 13, including the blank character. Therefore, the variable L1 is 1-(1/13)=0.923. On the other hand, when converting the concatenated character string "6aturated Fat9g" to the item value "Saturated Fat", three characters need to be replaced. The length of the character string "6aturated Fat9g" is 15, including the blank character. Therefore, the variable L2 is 1-3/15=0.8. As a result of comparing the variables L1 and L2, L1>L2 (No in step S311), and the process proceeds to step S313.
In step S313, the previously calculated variable L1 is greater than a predetermined threshold T1 = 0.5 (Yes in step S313), so the unmerged single ruledline frame 508 becomes the ruled line frame merger candidate C[j] (step S314).
Then, 1 is added to the variable j (step S315), and the process returns to the beginning of step S304.
続いて、変数i=2、変数j=2のとき、jが項目数以下であるか判定する(ステップS304)。3行目の項目数は3のため、jは項目数以下であり(ステップS304のYes)、j列目の罫線枠内に文字認識結果があるか判定する(ステップS305)。変数i=2、変数j=2のとき、j列目の罫線枠(すなわち、罫線枠508)に文字列516が存在するので(ステップS305のYes)、変数kに0が代入される(ステップS306)。ステップS307において、3行目の項目数は3のため、項目数から(j+1)を減算した値はk(=0)と等しいので(ステップS307のYes)、ステップS308に進む。
次に、(j+k)列目の罫線枠内に文字認識結果があるか判定する(ステップS308)。(j+k)列目の罫線枠(k=0であり、すなわち、罫線枠508)には文字列516が存在するので(ステップS308のYes)、変数L1が計算され(ステップS309)、変数L2が計算される(ステップS310)。ここで、変数L1は、罫線枠508内の文字列516「6aturated Fat」と、知識データベース4の項目名「Item B」に属する項目値「Saturated Fat」との尤度であり、標準化編集距離NEDから算出することができる。また、変数L2は、罫線枠508内の文字列と罫線枠509内の文字列とを連結して得られた連結文字列である「6aturated Fat9g」と、知識データベース4の項目名「Item B」に属する項目値「Saturated Fat」との尤度であり、標準化編集距離NEDから算出することができる。続いて、変数L1と変数L2とが比較される(ステップS311)。
ここで、標準化編集距離NEDの算出において、文字列「6aturated Fat」を、項目値「Saturated Fat」に変換する場合、1文字の置換が必要である。「Saturated Fat」の文字列の長さは、空白文字を含み、13である。よって、変数L1は、1-(1/13)=0.923となる。一方、連結文字列「6aturated Fat9g」を、項目値「Saturated Fat」に変換する場合、3文字の置換が必要である。「6aturated Fat9g」の文字列の長さは、空白文字を含み、15である。よって、変数L2は、1-3/15=0.8となる。変数L1と変数L2を比較した結果、L1>L2であり(ステップS311のNo)、ステップS313に進む。
ステップS313において、直前で計算された変数L1は、所定の閾値T1=0.5よりも大きいので(ステップS313のYes)、統合されていない単独の罫線枠508が、罫線枠の統合候補C[j]となる(ステップS314)。
そして、変数jに1が加えられ(ステップS315)、ステップS304の先頭に戻る。 First, when variable i=2 and variable j=1, it is determined whether j is equal to or less than the number of items (step S304). Since the number of items in the third row is 3, j is equal to or less than the number of items (Yes in step S304), and it is determined whether a character recognition result is present within the ruled frame in the jth column (step S305). When variable i=2 and variable j=1, no character string is present in the ruled frame in the jth column (i.e., ruled frame 507) (No in step S305), so 1 is added to variable j (step S315) and the process returns to the beginning of step S304.
Next, when variable i=2 and variable j=2, it is determined whether j is equal to or less than the number of items (step S304). Since the number of items in the third row is 3, j is equal to or less than the number of items (Yes in step S304), and it is determined whether the character recognition result is present in the ruled frame in the jth column (step S305). When variable i=2 and variable j=2,
Next, it is determined whether the character recognition result is present in the ruled box in the (j+k)th column (step S308). Since the
Here, in the calculation of the standardized edit distance NED, when converting the character string "6aturated Fat" to the item value "Saturated Fat", one character needs to be replaced. The length of the character string "Saturated Fat" is 13, including the blank character. Therefore, the variable L1 is 1-(1/13)=0.923. On the other hand, when converting the concatenated character string "6aturated Fat9g" to the item value "Saturated Fat", three characters need to be replaced. The length of the character string "6aturated Fat9g" is 15, including the blank character. Therefore, the variable L2 is 1-3/15=0.8. As a result of comparing the variables L1 and L2, L1>L2 (No in step S311), and the process proceeds to step S313.
In step S313, the previously calculated variable L1 is greater than a predetermined threshold T1 = 0.5 (Yes in step S313), so the unmerged single ruled
Then, 1 is added to the variable j (step S315), and the process returns to the beginning of step S304.
最後に、変数i=2、変数j=3のとき、jが項目数以下であるか判定する(ステップS304)。3行目の項目数は3のため、jは項目数以下であり(ステップS304のYes)、j列目の罫線枠内に文字認識結果があるか判定する(ステップS305)。変数i=2、変数j=3のとき、j列目の罫線枠(すなわち、罫線枠509)に文字列517が存在するので(ステップS305のYes)、変数kに0が代入される(ステップS306)。ステップS307において、3行目の項目数は3のため、項目数から(j+1)を減算した値はk(=0)より小さいので(ステップS307のNo)、ステップS313に進む。
ステップS313において、変数L1が計算される。ここで、変数L1は、罫線枠509内の文字列517「9g」と、知識データベース4の項目名「Item C」に属する項目名「[*NUM*]g」との尤度であり、標準化編集距離NEDから算出することができる。
ここで、標準化編集距離NEDの算出において、文字列「9g」を、項目値「[*NUM*]g」に変換する場合、[*NUM*]はワイルドカードであり、任意の数値を入れることが可能なので、置換無しである。よって、変数L1は、1.0であり、所定の閾値T1=0.5よりも大きいので(ステップS313のYes)、統合されていない単独の罫線枠509が、罫線枠の統合候補C[j]となる(ステップS314)。 Finally, when variable i=2 and variable j=3, it is determined whether j is equal to or less than the number of items (step S304). Since the number of items in the third row is 3, j is equal to or less than the number of items (Yes in step S304), and it is determined whether the character recognition result is present in the ruled frame in the jth column (step S305). When variable i=2 and variable j=3,character string 517 is present in the ruled frame in the jth column (i.e., ruled frame 509) (Yes in step S305), so 0 is substituted for variable k (step S306). In step S307, since the number of items in the third row is 3, the value obtained by subtracting (j+1) from the number of items is less than k (=0) (No in step S307), so the process proceeds to step S313.
In step S313, a variable L1 is calculated. Here, the variable L1 is the likelihood between thecharacter string 517 “9g” in the ruled box 509 and the item name "[*NUM*]g" belonging to the item name “Item C” in the knowledge database 4, and can be calculated from the standardized edit distance NED.
Here, when converting the character string "9g" to the item value "[*NUM*]g" in the calculation of the standardized edit distance NED, [*NUM*] is a wildcard and any numerical value can be entered, so no replacement is performed. Therefore, the variable L1 is 1.0, which is greater than the predetermined threshold T1=0.5 (Yes in step S313), so the unintegrated single ruledframe 509 becomes the ruled frame integration candidate C[j] (step S314).
ステップS313において、変数L1が計算される。ここで、変数L1は、罫線枠509内の文字列517「9g」と、知識データベース4の項目名「Item C」に属する項目名「[*NUM*]g」との尤度であり、標準化編集距離NEDから算出することができる。
ここで、標準化編集距離NEDの算出において、文字列「9g」を、項目値「[*NUM*]g」に変換する場合、[*NUM*]はワイルドカードであり、任意の数値を入れることが可能なので、置換無しである。よって、変数L1は、1.0であり、所定の閾値T1=0.5よりも大きいので(ステップS313のYes)、統合されていない単独の罫線枠509が、罫線枠の統合候補C[j]となる(ステップS314)。 Finally, when variable i=2 and variable j=3, it is determined whether j is equal to or less than the number of items (step S304). Since the number of items in the third row is 3, j is equal to or less than the number of items (Yes in step S304), and it is determined whether the character recognition result is present in the ruled frame in the jth column (step S305). When variable i=2 and variable j=3,
In step S313, a variable L1 is calculated. Here, the variable L1 is the likelihood between the
Here, when converting the character string "9g" to the item value "[*NUM*]g" in the calculation of the standardized edit distance NED, [*NUM*] is a wildcard and any numerical value can be entered, so no replacement is performed. Therefore, the variable L1 is 1.0, which is greater than the predetermined threshold T1=0.5 (Yes in step S313), so the unintegrated single ruled
以上、表の項目値の2行目(変数i=2)に属する全ての罫線枠を評価し、単独の罫線枠508と、単独の罫線枠509とが罫線枠の統合候補C[j]として得られた。そして、ステップS316において、統合する罫線枠に重複が有るか否かを判定する。統合する罫線枠に重複が無いので(ステップS316のNo)、罫線枠508内の文字列516と、罫線枠509内の文字列517とは、それぞれ連結しない単独の文字列として取り扱われる。
As described above, all ruled line frames belonging to the second row (variable i = 2) of the table item values have been evaluated, and single ruled line frames 508 and 509 have been obtained as candidates for merging ruled line frames C[j]. Then, in step S316, it is determined whether there is any overlap in the ruled line frames to be merged. Since there is no overlap in the ruled line frames to be merged (No in step S316), character string 516 in ruled line frame 508 and character string 517 in ruled line frame 509 are each treated as a single, unlinked character string.
続いて、ステップS313において、連結文字列もしくは単独文字列に文字認識誤りが無いかどうか、尤度を用いて判定する。文字列516の尤度は0.923、すなわち、項目値に記載されるべき文字列と完全一致ではないので、尤度の値と誤り判定のための所定の閾値T2とを比較する。このとき、誤り判定のための所定の閾値T2として、例えば0.7が好適であり予め設定することができる。文字列516の尤度(0.923)の値は、誤り判定のための所定の閾値T2(0.7)以上であるため、この文字列には誤りがあると推測される。そこで、文字列516の代わりとして、文字列523の文字列を、知識データベース4に登録されている適合文字列の中で最も尤度の高かった文字列「Saturated Fat」に置き換える。そして、文字列523は、最終的な統合判断結果であり認識結果でもある統合文字列として出力される。
一方、文字列517の尤度は1.0であるので、文字列517は正しく認識されたと判断され、文字列524(「9g」)が、最終的な統合判断結果であり認識結果でもある統合文字列として出力される。
なお、連結文字列もしくは単独文字列の尤度が1.0(すなわち、完全一致)の場合であっても、連結文字列もしくは単独文字列の代わりに、知識データベース4に登録されている適合文字列の中で最も尤度が高かった文字列に置き換えても構わない。これは、置換しても同じ文字列になるからである。つまり、尤度が、1.0の場合を含めて誤り判定のための所定の閾値T2以上の場合、知識データベース4に登録されている適合文字列の中で最も尤度の高かった文字列に置き換えてもよい。言い換えれば、尤度が、誤り判定のための所定の閾値T2以上の場合、当該尤度を算出するのに用いた知識データベース4に登録されている適合文字列に置き換えてもよい。 Next, in step S313, it is determined whether or not there is a character recognition error in the concatenated character string or the single character string using the likelihood. The likelihood of thecharacter string 516 is 0.923, that is, it is not an exact match with the character string that should be written in the item value, so the value of the likelihood is compared with a predetermined threshold T2 for error determination. At this time, the predetermined threshold T2 for error determination is preferably, for example, 0.7, which can be set in advance. Since the value of the likelihood of the character string 516 (0.923) is equal to or greater than the predetermined threshold T2 (0.7) for error determination, it is presumed that there is an error in this character string. Therefore, as a substitute for the character string 516, the character string 523 is replaced with the character string "Saturated Fat", which has the highest likelihood among the matching character strings registered in the knowledge database 4. Then, the character string 523 is output as an integrated character string, which is both the final integrated judgment result and the recognition result.
On the other hand, since the likelihood ofcharacter string 517 is 1.0, character string 517 is determined to have been correctly recognized, and character string 524 ("9g") is output as an integrated character string which is both the final integrated judgment result and the recognition result.
Even if the likelihood of a concatenated string or a single string is 1.0 (i.e., an exact match), the concatenated string or the single string may be replaced with the string with the highest likelihood among the matching strings registered in theknowledge database 4. This is because the replacement will result in the same string. In other words, if the likelihood is equal to or greater than a predetermined threshold T2 for error determination, including the case where the likelihood is 1.0, the string may be replaced with the string with the highest likelihood among the matching strings registered in the knowledge database 4. In other words, if the likelihood is equal to or greater than a predetermined threshold T2 for error determination, the string may be replaced with the matching string registered in the knowledge database 4 that was used to calculate the likelihood.
一方、文字列517の尤度は1.0であるので、文字列517は正しく認識されたと判断され、文字列524(「9g」)が、最終的な統合判断結果であり認識結果でもある統合文字列として出力される。
なお、連結文字列もしくは単独文字列の尤度が1.0(すなわち、完全一致)の場合であっても、連結文字列もしくは単独文字列の代わりに、知識データベース4に登録されている適合文字列の中で最も尤度が高かった文字列に置き換えても構わない。これは、置換しても同じ文字列になるからである。つまり、尤度が、1.0の場合を含めて誤り判定のための所定の閾値T2以上の場合、知識データベース4に登録されている適合文字列の中で最も尤度の高かった文字列に置き換えてもよい。言い換えれば、尤度が、誤り判定のための所定の閾値T2以上の場合、当該尤度を算出するのに用いた知識データベース4に登録されている適合文字列に置き換えてもよい。 Next, in step S313, it is determined whether or not there is a character recognition error in the concatenated character string or the single character string using the likelihood. The likelihood of the
On the other hand, since the likelihood of
Even if the likelihood of a concatenated string or a single string is 1.0 (i.e., an exact match), the concatenated string or the single string may be replaced with the string with the highest likelihood among the matching strings registered in the
なお、連結文字列の尤度が、誤り判定のための所定の閾値T2(例えば、0.7)未満の場合、例えば、文字列自体は正しく認識されているが、知識データベース4に登録されている適合文字列との一致度が低い場合が考えられる。その場合、連結文字列もしくは単独文字列を知識データベース4に登録されている適合文字列に置換せず、そのまま出力してもよい。
If the likelihood of the concatenated string is less than a predetermined threshold T2 (e.g., 0.7) for determining an error, it may be that, for example, the string itself has been correctly recognized, but the degree of match with the matching string registered in the knowledge database 4 is low. In that case, the concatenated string or the single string may be output as is, without being replaced with the matching string registered in the knowledge database 4.
以上の処理を全ての罫線枠について行い、罫線枠の最終的な統合判断から得られる統合文字列として、文字列518から文字列524がそれぞれ得られる。
The above process is performed for all ruled lines, and strings 518 to 524 are obtained as integrated strings obtained from the final integration decision of the ruled lines.
この実施の形態1では、表認識装置の具体的な動作例について、表形式文書の項目値に属する罫線枠についてのみ説明したがこれに限らない。例えば、項目名に関しても項目値と同様に認識することが可能である。この場合、例えば、知識データベース4に項目名に関する適合文字列を登録しておき、項目名に属する罫線枠に対し、項目値に属する罫線枠の場合と同様な処理を行うことができる。
In this embodiment 1, a specific example of the operation of the table recognition device has been described only for the ruled frames belonging to the item values of a tabular document, but this is not limiting. For example, it is possible to recognize item names in the same way as item values. In this case, for example, matching character strings for item names can be registered in the knowledge database 4, and the same processing can be performed on the ruled frames belonging to item names as on the ruled frames belonging to item values.
上記したように、罫線枠統合判定部3は、罫線枠の統合判定において、隣接する罫線枠の罫線枠情報を用いない。よって、罫線枠情報に依存せずに、罫線枠を正しく統合することが可能である。
As described above, the ruled line frame integration judgment unit 3 does not use the ruled line frame information of adjacent ruled line frames when judging the ruled line frame integration. Therefore, it is possible to correctly integrate ruled line frames without relying on the ruled line frame information.
また、罫線枠統合判定部3は、知識データベース4に登録されている適合文字列を参照し、連結文字列が無意味な文字列の羅列では無く、意味を成す可能性が高い場合(すなわち、尤度が高く、項目名又は項目値に近いと判断される場合)、連結すべきと判定する。よって、項目名又は項目値の一部に間違いがある状態(例えば、文字認識誤り、誤記、記載内容の一部省略、など)でも、罫線枠を統合させることが可能である。更に、知識データベース4に登録されている適合文字列に近い文字列に置換することができるので、正しい文字列を出力することができる。
The ruled frame integration determination unit 3 also references the matching strings registered in the knowledge database 4, and determines that the concatenated string should be concatenated if it is not a meaningless string of characters and is likely to be meaningful (i.e., if it is judged to have a high likelihood and be close to the item name or item value). Therefore, it is possible to integrate ruled frames even if there is an error in part of the item name or item value (for example, a character recognition error, a typo, partial omission of the written content, etc.). Furthermore, it is possible to replace it with a string that is close to the matching string registered in the knowledge database 4, so the correct string can be output.
以上、実施の形態1にて詳述した表認識装置は、各枠線内の文字認識結果の項目に属する一致度として尤度を算出し、算出された尤度に基づいてどの文字列を連結すべきか判定するようにした。
よって、罫線枠情報に依存せずに、複数の罫線枠にまたがって記載された文字列を正確に認識することができる。 The table recognition device described above in detail inembodiment 1 calculates the likelihood as the degree of match belonging to the item of the character recognition result within each frame line, and determines which character strings should be concatenated based on the calculated likelihood.
Therefore, a character string written across a plurality of ruled frames can be accurately recognized without relying on ruled frame information.
よって、罫線枠情報に依存せずに、複数の罫線枠にまたがって記載された文字列を正確に認識することができる。 The table recognition device described above in detail in
Therefore, a character string written across a plurality of ruled frames can be accurately recognized without relying on ruled frame information.
また、実施の形態1にて詳述した表認識装置は、知識データベースに登録されている適合文字列を参照し、連結文字列が意味を成す可能性が高い場合に連結すべきと判定するようにした。更に、連結文字列を、知識データベースに登録されている適合文字列に近い文字列に置換するようにした。
よって、項目名又は項目名に文字認識誤りがあっても、正確に文字列を連結させることが可能となるだけでなく、同時に文字認識結果の誤り訂正も可能となる相乗効果を得ることができる。 The table recognition device described in detail in the first embodiment refers to matching strings registered in a knowledge database, and determines that a concatenated string should be concatenated if it is highly likely that the concatenated string will make sense. Furthermore, the concatenated string is replaced with a string that is close to the matching string registered in the knowledge database.
Therefore, even if there is an error in character recognition in the item name or in the item name, not only can character strings be concatenated accurately, but also errors in the character recognition results can be corrected at the same time, providing a synergistic effect.
よって、項目名又は項目名に文字認識誤りがあっても、正確に文字列を連結させることが可能となるだけでなく、同時に文字認識結果の誤り訂正も可能となる相乗効果を得ることができる。 The table recognition device described in detail in the first embodiment refers to matching strings registered in a knowledge database, and determines that a concatenated string should be concatenated if it is highly likely that the concatenated string will make sense. Furthermore, the concatenated string is replaced with a string that is close to the matching string registered in the knowledge database.
Therefore, even if there is an error in character recognition in the item name or in the item name, not only can character strings be concatenated accurately, but also errors in the character recognition results can be corrected at the same time, providing a synergistic effect.
実施の形態2.
上述した実施の形態1では、罫線枠の統合判定に知識データベースを用いているが、これに限らない。例えば、罫線枠の統合判定において、統合文字列として記述可能な適合文字列に制約する情報である表構造の制約の情報を利用することもできる。この構成を実施の形態2として説明する。Embodiment 2.
In the above-mentioned first embodiment, a knowledge database is used for the ruled frame integration judgment, but this is not limited to this. For example, in the ruled frame integration judgment, information on table structure constraints, which is information that restricts compatible character strings that can be written as an integrated character string, can also be used. This configuration will be described as the second embodiment.
上述した実施の形態1では、罫線枠の統合判定に知識データベースを用いているが、これに限らない。例えば、罫線枠の統合判定において、統合文字列として記述可能な適合文字列に制約する情報である表構造の制約の情報を利用することもできる。この構成を実施の形態2として説明する。
In the above-mentioned first embodiment, a knowledge database is used for the ruled frame integration judgment, but this is not limited to this. For example, in the ruled frame integration judgment, information on table structure constraints, which is information that restricts compatible character strings that can be written as an integrated character string, can also be used. This configuration will be described as the second embodiment.
図8は、実施の形態2における表認識装置100の構成を表す機能構成図である。図1と比較して新たな構成は、表構造知識データベース5である。その他の構成と動作については図1と同様であり、説明を省略する。
FIG. 8 is a functional configuration diagram showing the configuration of the table recognition device 100 in the second embodiment. The new component compared to FIG. 1 is the table structure knowledge database 5. The other components and operations are the same as those in FIG. 1, and the description will be omitted.
表構造知識データベース5は、統合文字列として罫線枠内に記述可能な適合文字列を制約する情報である表構造の制約の情報を記憶する。例えば、表構造の制約の情報は、周辺の罫線枠内の文字列情報に基づき、知識データベース4内に定められている複数の適合文字列の中から、統合文字列として罫線枠内に記述可能な適合文字列を制約する情報である。より具体的に言えば、例えば、表構造の制約の情報は、表の項目が大項目、中項目、小項目のように分類を表す場合、それら項目と項目の間にある関係性を示す情報である。例えば、表構造知識データベース5は、知識データベース4と同様な方法で、項目名に属する項目値として記述可能な適合文字列が登録されていてもよい。図9は、表構造知識データベース5の一例である。図9に示した表構造の制約の情報として、左側の列に、項目名「Item A」に属する適合文字列(制約文字列と称する)が登録されている。また、右側の列に、項目名「Item A」の適合文字列(すなわち、制約文字列)が記述された場合に、隣接する項目名「Item B」に記述可能な適合文字列(記述可能文字列と称する)が登録されている。なお、表構造知識データベース5は、適合文字列の文字列は1つの単語に限らず、複数の単語、文節もしくは文章であってもよい。
The table structure knowledge database 5 stores information on table structure constraints, which is information that restricts compatible strings that can be written in a ruled frame as an integrated string. For example, the information on table structure constraints is information that restricts compatible strings that can be written in a ruled frame as an integrated string from among multiple compatible strings defined in the knowledge database 4 based on string information in the surrounding ruled frames. More specifically, for example, when table items represent classifications such as major items, medium items, and minor items, the information on table structure constraints is information that indicates the relationship between these items. For example, the table structure knowledge database 5 may register compatible strings that can be written as item values belonging to item names in a similar manner to the knowledge database 4. Figure 9 is an example of the table structure knowledge database 5. As information on the table structure constraints shown in Figure 9, a compatible string (called a constraint string) belonging to the item name "Item A" is registered in the left column. Also, in the right column, when a matching string (i.e., a constrained string) for the item name "Item A" is written, a matching string (called a writable string) that can be written for the adjacent item name "Item B" is registered. Note that in the table structure knowledge database 5, the string of a matching string is not limited to a single word, and may be multiple words, phrases, or sentences.
罫線枠統合判定部3は、知識データベース4と表構造知識データベース5と、図示しないメモリMEMに記憶されている罫線枠の統合候補C[j]とを参照して、表構造の制約の情報を用いて複数の適合文字列の中から1以上の適合文字列に限定する。そして、限定された1以上の適合文字列と文字認識部2が認識した文字列との一致度を算出し、一致度の高さに応じてどの罫線枠を統合すべきかを判定する。
本実施の形態2では、例えば、罫線枠統合判定部3は、表構造知識データベース5を参照して、罫線枠の統合候補C[j]が所定の項目名に属する項目値(制約文字列)に該当する場合、知識データベース4に対し、当該所定の項目名に隣接する他の項目名に属する項目値の候補(すなわち、統合文字列)を、記述可能文字列に制約することで1以上の適合文字列に限定する。 The ruled line frameintegration determination unit 3 refers to the knowledge database 4, the table structure knowledge database 5, and the ruled line frame integration candidates C[j] stored in a memory MEM (not shown), and limits the multiple matching character strings to one or more matching character strings using information on the constraints of the table structure. It then calculates the degree of match between the limited one or more matching character strings and the character string recognized by the character recognition unit 2, and determines which ruled line frames should be integrated depending on the degree of match.
In thisembodiment 2, for example, the ruled line frame integration judgment unit 3 refers to the table structure knowledge database 5, and when the ruled line frame integration candidate C[j] corresponds to an item value (constrained string) belonging to a specified item name, the unit 3 restricts the item value candidates (i.e., integrated strings) belonging to other item names adjacent to the specified item name in the knowledge database 4 to one or more matching strings by restricting them to describable strings.
本実施の形態2では、例えば、罫線枠統合判定部3は、表構造知識データベース5を参照して、罫線枠の統合候補C[j]が所定の項目名に属する項目値(制約文字列)に該当する場合、知識データベース4に対し、当該所定の項目名に隣接する他の項目名に属する項目値の候補(すなわち、統合文字列)を、記述可能文字列に制約することで1以上の適合文字列に限定する。 The ruled line frame
In this
図10は、実施の形態2における罫線統合判定部3の動作順序を表すフローチャートである。図10において、図6と比較して異なるステップは、ステップS309AとステップS310Aである。図6と同じ番号を付与したステップは、実施の形態1にて示した処理と同様の処理を行うため、説明を省略する。
FIG. 10 is a flow chart showing the operation sequence of the ruled line integration determination unit 3 in the second embodiment. In FIG. 10, the steps that differ from FIG. 6 are step S309A and step S310A. Steps that are given the same numbers as in FIG. 6 perform the same processing as shown in the first embodiment, and therefore their explanations are omitted.
ステップS309Aにおいて、j列目から(j+k)列目までの罫線枠を統合し、罫線枠内の文字列を連結し、連結文字列を得る。そして、表制約情報データベース5と、メモリMEM内に記憶されている罫線枠の統合候補C[j-1]とを参照し、得られた連結文字列に隣接する連結文字列である罫線枠の統合候補C[j-1]が、得られた連結文字列に対する制約文字列か否かを判定する(ステップS309A)。
罫線枠の統合候補C[j-1]が制約文字列に該当する場合、知識データベース4中のj番目(すなわち、j列目)の項目名に属する適合文字列を、表構造知識データベース5に記載された記述可能文字列に制約する。そして、連結文字列がj列目の項目名に属する尤度 [j+k,j]を算出し、尤度L[j+k,j]を変数L1に代入する。罫線枠の統合候補C[j-1]が制約文字列に該当しない場合、表構造知識データベース5による制約は行わず、連結文字列がj列目の項目名に属する尤度 [j+k,j]を算出し、尤度L[j+k,j]を変数L1に代入する(ステップS309A)。 In step S309A, the ruled boxes from the jth column to the (j+k)th column are merged, and the character strings in the ruled boxes are concatenated to obtain a concatenated character string. Then, by referring to the tableconstraint information database 5 and the ruled box merging candidate C[j-1] stored in the memory MEM, it is determined whether the ruled box merging candidate C[j-1], which is a concatenated character string adjacent to the obtained concatenated character string, is a constrained character string for the obtained concatenated character string (step S309A).
If the ruled frame merging candidate C[j-1] corresponds to the restricted string, the matching string belonging to the jth item name (i.e., the jth column) in theknowledge database 4 is restricted to the describable strings described in the table structure knowledge database 5. Then, the likelihood [j+k,j] that the concatenated string belongs to the item name in the jth column is calculated, and the likelihood L[j+k,j] is substituted for the variable L1. If the ruled frame merging candidate C[j-1] does not correspond to the restricted string, no restriction is performed using the table structure knowledge database 5, and the likelihood [j+k,j] that the concatenated string belongs to the item name in the jth column is calculated, and the likelihood L[j+k,j] is substituted for the variable L1 (step S309A).
罫線枠の統合候補C[j-1]が制約文字列に該当する場合、知識データベース4中のj番目(すなわち、j列目)の項目名に属する適合文字列を、表構造知識データベース5に記載された記述可能文字列に制約する。そして、連結文字列がj列目の項目名に属する尤度 [j+k,j]を算出し、尤度L[j+k,j]を変数L1に代入する。罫線枠の統合候補C[j-1]が制約文字列に該当しない場合、表構造知識データベース5による制約は行わず、連結文字列がj列目の項目名に属する尤度 [j+k,j]を算出し、尤度L[j+k,j]を変数L1に代入する(ステップS309A)。 In step S309A, the ruled boxes from the jth column to the (j+k)th column are merged, and the character strings in the ruled boxes are concatenated to obtain a concatenated character string. Then, by referring to the table
If the ruled frame merging candidate C[j-1] corresponds to the restricted string, the matching string belonging to the jth item name (i.e., the jth column) in the
ステップS310Aにおいて、j列目から(j+k+1)列目までの罫線枠を統合し、罫線枠内の文字列を連結し、連結文字列を得る。そして、ステップS309Aの処理と同様に、表制約情報データベース5と、メモリMEM内に記憶されている罫線枠の統合候補C[j-1]とを参照し、得られた連結文字列に隣接する連結文字列である罫線枠の統合候補C[j-1]が、得られた連結文字列に対する制約文字列か否かを判定する(ステップS310A)。
罫線枠の統合候補C[j-1]が制約文字列に該当する場合、知識データベース4中のj番目(すなわち、j列目)の項目名に属する適合文字列を、表構造知識データベース5に記載された記述可能文字列に制約する。そして、連結文字列がj番目の項目名に属する尤度 [j+k+1,j]を算出し、尤度L[j+k+1,j]を変数L2に代入する。罫線枠の統合候補C[j-1]が制約文字列に該当しない場合、表構造知識データベース5による制約は行わず、連結文字列がj番目の項目名に属する尤度 [j+k+1,j]を算出し、尤度L[j+k+1,j]を変数L2に代入する(ステップS310A)。 In step S310A, the ruled boxes from the jth column to the (j+k+1)th column are merged, and the character strings in the ruled boxes are concatenated to obtain a concatenated character string. Then, similar to the processing in step S309A, the tableconstraint information database 5 and the ruled box merging candidate C[j-1] stored in the memory MEM are referenced to determine whether the ruled box merging candidate C[j-1], which is a concatenated character string adjacent to the obtained concatenated character string, is a constrained character string for the obtained concatenated character string (step S310A).
If the ruled frame merging candidate C[j-1] corresponds to the restricted string, the matching string belonging to the jth item name (i.e., the jth column) in theknowledge database 4 is restricted to the describable strings described in the table structure knowledge database 5. Then, the likelihood [j+k+1,j] that the concatenated string belongs to the jth item name is calculated, and the likelihood L[j+k+1,j] is substituted for the variable L2. If the ruled frame merging candidate C[j-1] does not correspond to the restricted string, no restriction is performed using the table structure knowledge database 5, and the likelihood [j+k+1,j] that the concatenated string belongs to the jth item name is calculated, and the likelihood L[j+k+1,j] is substituted for the variable L2 (step S310A).
罫線枠の統合候補C[j-1]が制約文字列に該当する場合、知識データベース4中のj番目(すなわち、j列目)の項目名に属する適合文字列を、表構造知識データベース5に記載された記述可能文字列に制約する。そして、連結文字列がj番目の項目名に属する尤度 [j+k+1,j]を算出し、尤度L[j+k+1,j]を変数L2に代入する。罫線枠の統合候補C[j-1]が制約文字列に該当しない場合、表構造知識データベース5による制約は行わず、連結文字列がj番目の項目名に属する尤度 [j+k+1,j]を算出し、尤度L[j+k+1,j]を変数L2に代入する(ステップS310A)。 In step S310A, the ruled boxes from the jth column to the (j+k+1)th column are merged, and the character strings in the ruled boxes are concatenated to obtain a concatenated character string. Then, similar to the processing in step S309A, the table
If the ruled frame merging candidate C[j-1] corresponds to the restricted string, the matching string belonging to the jth item name (i.e., the jth column) in the
前出の図7に示した表を対象に、図9に示した表構造知識データベース5を用いて、本実施の形態2の表認識装置の動作の具体的な一例を説明する。
A specific example of the operation of the table recognition device of this embodiment 2 will be described using the table structure knowledge database 5 shown in FIG. 9 for the table shown in FIG. 7 above.
罫線枠統合判定部3は、表構造知識データベース5を参照し、各罫線枠内の文字認識結果が項目に属する尤度を算出する際に用いる知識データベース4に制約を加える。
具体的には、図7(d)において、項目名「Item A」に属する項目値として、文字列521(「Total Fat」)が得られると、図9の表構造知識データベース5を参照して、隣接する項目名「Item B」に属する項目値の記述可能文字列の候補は、「Saturated Fat」又は「Trans Fat」に制約される。 The ruled line boxintegration judgment unit 3 refers to a table structure knowledge database 5 and imposes constraints on a knowledge database 4 used when calculating the likelihood that the character recognition results within each ruled line box belong to an item.
Specifically, when character string 521 ("Total Fat") is obtained as an item value belonging to the item name "Item A" in FIG. 7(d), by referring to tablestructure knowledge database 5 in FIG. 9, the candidates for describable character strings for the item value belonging to the adjacent item name "Item B" are restricted to "Saturated Fat" or "Trans Fat."
具体的には、図7(d)において、項目名「Item A」に属する項目値として、文字列521(「Total Fat」)が得られると、図9の表構造知識データベース5を参照して、隣接する項目名「Item B」に属する項目値の記述可能文字列の候補は、「Saturated Fat」又は「Trans Fat」に制約される。 The ruled line box
Specifically, when character string 521 ("Total Fat") is obtained as an item value belonging to the item name "Item A" in FIG. 7(d), by referring to table
隣接する項目名に属する項目値の候補である統合文字列を、記述可能な適合文字列に制約することで、罫線枠の統合のための候補をより正確な文字列に限定することができ、罫線枠の統合判定の精度を向上させることができる。更に、罫線枠の統合のための候補数を削減することができるので、尤度計算のための処理量を少なくすることができる。
By restricting the integration strings, which are candidates for item values belonging to adjacent item names, to compatible strings that can be written, it is possible to limit the candidates for integrating ruled frames to more accurate strings, improving the accuracy of the ruled frame integration determination. Furthermore, since the number of candidates for integrating ruled frames can be reduced, the amount of processing required for likelihood calculation can be reduced.
以上、実施の形態2にて詳述した表認識装置は、罫線枠の統合判定において、統合文字列として記述可能な適合文字列に制約する情報である表構造の制約の情報を用いて1以上の適合文字列に限定しているので、罫線枠の統合判定の精度を向上させることができる。
The table recognition device described above in detail inembodiment 2 limits the matching character strings to one or more matching character strings in the ruled frame integration determination by using table structure constraint information, which is information that restricts the matching character strings that can be written as an integrated character string, thereby improving the accuracy of the ruled frame integration determination.
The table recognition device described above in detail in
実施の形態3.
罫線枠の統合判定において、罫線に近接する文字が誤って認識されている可能性を考慮して判定することもできる。この構成を実施の形態3として説明する。Embodiment 3.
In the integrated determination of ruled line frames, the determination can be made taking into consideration the possibility that characters adjacent to the ruled lines have been erroneously recognized. This configuration will be described as a third embodiment.
罫線枠の統合判定において、罫線に近接する文字が誤って認識されている可能性を考慮して判定することもできる。この構成を実施の形態3として説明する。
In the integrated determination of ruled line frames, the determination can be made taking into consideration the possibility that characters adjacent to the ruled lines have been erroneously recognized. This configuration will be described as a third embodiment.
罫線枠統合判定3は、認識結果の文字列がある項目に属する尤度を算出する際、文字単位で「重み付け」することで、罫線に近接する文字の誤認識の影響を小さくするようにする。
例えば、尤度として標準化編集距離を用いる場合、標準化編集距離算出時の挿入、削除、置換などの文字変換のコスト計算において、罫線に近接する文字の文字変換にかかるコストの値を小さく重み付けすることができる。コストの値の重み付けの値は、通常の1に対し、例えば、0.5が好適であるが、これに限らない。例えば、罫線の種類などに応じて、コストの値の重み付けの値は適宜変更することができる。
罫線に近接する文字の文字変換にかかるコストの値を小さくするように重み付けすることで、罫線に近接する文字の誤認識の影響を抑制することができる。言い換えれば、罫線に近接する文字の誤認識を許容することができる。 When calculating the likelihood that a character string in a recognition result belongs to a certain item, the ruled lineframe integration judgment 3 "weights" each character to reduce the influence of erroneous recognition of characters close to ruled lines.
For example, when the standardized edit distance is used as the likelihood, the cost value for character conversion of characters close to a ruled line can be weighted lightly in cost calculation of character conversion such as insertion, deletion, and replacement when calculating the standardized edit distance. The weighting value of the cost value is preferably, for example, 0.5 compared to the usual 1, but is not limited to this. For example, the weighting value of the cost value can be appropriately changed depending on the type of the ruled line, etc.
By weighting characters close to ruled lines so as to reduce the cost of character conversion, the effect of erroneous recognition of characters close to ruled lines can be suppressed, or in other words, erroneous recognition of characters close to ruled lines can be tolerated.
例えば、尤度として標準化編集距離を用いる場合、標準化編集距離算出時の挿入、削除、置換などの文字変換のコスト計算において、罫線に近接する文字の文字変換にかかるコストの値を小さく重み付けすることができる。コストの値の重み付けの値は、通常の1に対し、例えば、0.5が好適であるが、これに限らない。例えば、罫線の種類などに応じて、コストの値の重み付けの値は適宜変更することができる。
罫線に近接する文字の文字変換にかかるコストの値を小さくするように重み付けすることで、罫線に近接する文字の誤認識の影響を抑制することができる。言い換えれば、罫線に近接する文字の誤認識を許容することができる。 When calculating the likelihood that a character string in a recognition result belongs to a certain item, the ruled line
For example, when the standardized edit distance is used as the likelihood, the cost value for character conversion of characters close to a ruled line can be weighted lightly in cost calculation of character conversion such as insertion, deletion, and replacement when calculating the standardized edit distance. The weighting value of the cost value is preferably, for example, 0.5 compared to the usual 1, but is not limited to this. For example, the weighting value of the cost value can be appropriately changed depending on the type of the ruled line, etc.
By weighting characters close to ruled lines so as to reduce the cost of character conversion, the effect of erroneous recognition of characters close to ruled lines can be suppressed, or in other words, erroneous recognition of characters close to ruled lines can be tolerated.
なお、罫線に文字が近接しているか否については、例えば、罫線に文字が接触する場合、罫線に文字が近接していると判断される。また、罫線に文字が接触していない場合であっても、例えば、罫線から文字までの距離で判断されてもよい。この場合、罫線から文字までの距離が所定の閾値より近い場合、罫線に文字が近接していると判断される。なお、罫線から文字までの距離の閾値は、例えば、罫線の太さ、文字の大きさ等に応じた値を予め設定することができる。具体的には、罫線の太さの3倍の距離を、罫線から文字までの距離の閾値とすることができる。また、一つの罫線枠内の文字列において、近接と判断される文字の数は1文字に限らない。例えば、「ABC」という3文字の文字列で、「B」及び「C」の罫線から文字までの距離が所定の閾値より近い場合、文字「B」と文字「C」は、罫線と近接していると判断される。つまり、文字「B」と文字「C」は、共にコストの値の重み付けの対象とすることができる。
Regarding whether a character is close to a ruled line, for example, if the character touches the ruled line, the character is determined to be close to the ruled line. Even if the character is not touching the ruled line, for example, the distance from the ruled line to the character may be used to determine whether the character is close to the ruled line. In this case, if the distance from the ruled line to the character is closer than a predetermined threshold, the character is determined to be close to the ruled line. The threshold value for the distance from the ruled line to the character can be set in advance to a value corresponding to, for example, the thickness of the ruled line or the size of the character. Specifically, the threshold value for the distance from the ruled line to the character can be set to a distance three times the thickness of the ruled line. Furthermore, the number of characters determined to be close in a character string within one ruled line frame is not limited to one character. For example, in a three-character character string "ABC," if the distance from the ruled line to the characters "B" and "C" is closer than a predetermined threshold, the characters "B" and "C" are determined to be close to the ruled line. In other words, both the characters "B" and "C" can be subject to weighting of the cost value.
図11は、本実施の形態3の表認識装置の動作の具体的な一例である。図11(a)は、認識対象の表の例である。図11(a)に示した表は、項目名「Item A」に属する項目値として「Total Fat」、項目名「Item C」に属する項目値として「25g」を持つ。図11(b)は、図11(a)に対する表構造認識結果の例である。図11(c)は、図11(b)に対する文字認識結果の例である。図11(d)は、図11(c)を罫線枠の統合判定により得られた文字列認識結果の例である。
FIG. 11 is a specific example of the operation of the table recognition device of this embodiment 3. FIG. 11(a) is an example of a table to be recognized. The table shown in FIG. 11(a) has "Total Fat" as an item value belonging to the item name "Item A" and "25g" as an item value belonging to the item name "Item C". FIG. 11(b) is an example of the table structure recognition result for FIG. 11(a). FIG. 11(c) is an example of the character recognition result for FIG. 11(b). FIG. 11(d) is an example of the character string recognition result obtained by integrating the ruled frame of FIG. 11(c).
図11の例では、まず表構造認識部1で、罫線枠601から罫線枠606がそれぞれ認識される。
In the example of Figure 11, first, the table structure recognition unit 1 recognizes ruled frame 601 to ruled frame 606.
次に、文字認識部2で、各罫線枠内の文字列607から文字列612がそれぞれ認識される。ここで、図11(c)に示すように、文字列610は、縦の二重罫線と近接する文字「F」が、文字「P」に誤って認識されている。また、文字列611は、縦の二重罫線と近接する文字「a」が、文字「p」に誤って認識されている。
Then, the character recognition unit 2 recognizes character strings 607 to 612 within each ruled frame. Here, as shown in FIG. 11(c), in character string 610, the character "F" adjacent to the vertical double ruled line is mistakenly recognized as the character "P." Also, in character string 611, the character "a" adjacent to the vertical double ruled line is mistakenly recognized as the character "p."
続いて、罫線枠統合判定部3の具体的な動作を説明する。
罫線枠604(すなわち、文字列610(「Total P」))と、罫線枠605(すなわち、文字列611(「pt」))とを連結して得られた連結文字列(すなわち、「Total Ppt」)と、知識データベース4に登録されている適合文字列とを評価する場合を考える。説明を簡単にするため、知識データベース4の項目名「Item A」に属する項目値「Total Fat」の場合についてのみ述べる。 Next, a specific operation of the ruled line frameintegration determination section 3 will be described.
Consider the case where a concatenated string (i.e., "Total Ppt") obtained by concatenating ruled box 604 (i.e., character string 610 ("Total P")) with ruled box 605 (i.e., character string 611 ("pt")) is evaluated against matching strings registered inknowledge database 4. For simplicity of explanation, only the case of item value "Total Fat" belonging to item name "Item A" in knowledge database 4 will be described.
罫線枠604(すなわち、文字列610(「Total P」))と、罫線枠605(すなわち、文字列611(「pt」))とを連結して得られた連結文字列(すなわち、「Total Ppt」)と、知識データベース4に登録されている適合文字列とを評価する場合を考える。説明を簡単にするため、知識データベース4の項目名「Item A」に属する項目値「Total Fat」の場合についてのみ述べる。 Next, a specific operation of the ruled line frame
Consider the case where a concatenated string (i.e., "Total Ppt") obtained by concatenating ruled box 604 (i.e., character string 610 ("Total P")) with ruled box 605 (i.e., character string 611 ("pt")) is evaluated against matching strings registered in
まず、罫線枠に近接する文字のコストの値の重み付けを行わない場合について説明する。連結して得られた連結文字列「Total Ppt」を「Total Fat」に変換する場合、「Pp」の2文字の置換が必要である。また、「Total Fat」の文字列の長さは、空白文字を含み、9である。よって、尤度は、1-(2/9)=0.778となる。
First, we will explain the case where the cost values of characters close to the ruled frame are not weighted. When converting the concatenated string "Total Ppt" obtained by concatenation into "Total Fat", it is necessary to replace the two characters "Pp". Furthermore, the length of the string "Total Fat" is 9, including spaces. Therefore, the likelihood is 1 - (2/9) = 0.778.
次に、尤度の計算において、罫線に近接する文字の文字変換にかかるコストの値の重みを0.5に設定する場合を説明する。この場合、文字「P」及び文字「p」が、コストの値の重み付けの対象となる。連結して得られた連結文字列「Total Ppt」を「Total Fat」に変換する場合、尤度は、1-(0.5×2/9)=0.889となる。
Next, we will explain the case where the weight of the cost value for character conversion of characters close to a ruled line is set to 0.5 in the likelihood calculation. In this case, the characters "P" and "p" are subject to the weighting of the cost value. When converting the concatenated character string "Total Ppt" obtained by concatenation into "Total Fat", the likelihood is 1 - (0.5 x 2/9) = 0.889.
上記のように、コストの値の重み付けを行わない場合の尤度0.778に対して、コストの値の重み付けを行う場合は尤度0.889となる。つまり、コストの値の重み付けを行わない場合と比較して尤度は高くなり、他の項目名の文字列が、(正しい文字列と)誤って採用される可能性を低減することができる。よって、罫線枠の統合判定の精度を更に向上させることができる。
As described above, the likelihood is 0.778 when the cost value is not weighted, whereas the likelihood is 0.889 when the cost value is weighted. In other words, the likelihood is higher than when the cost value is not weighted, and the possibility that a character string of another item name will be mistakenly adopted (as the correct character string) can be reduced. This can further improve the accuracy of the ruled frame integration judgment.
なお、上記した具体例では、縦の罫線に近接する文字について、コストの値を小さく重み付けしたが、これに限らない。例えば、横の罫線に近接する文字についても、同様の処理を行うことが可能であり、上記したのと同様の効果を奏功する。
In the above specific example, the cost value is weighted lightly for characters close to vertical lines, but this is not limited to the above. For example, the same processing can be performed on characters close to horizontal lines, and the same effect as described above can be achieved.
以上、実施の形態3にて詳述した表認識装置は、罫線枠統合判定部の尤度の計算において、罫線に近接する文字の文字変換にかかるコストの値を小さく重み付けした。
よって、他の文字と比較して誤認識されている可能性の高い文字の影響が抑制されるので、罫線枠統合判定の精度を更に向上させることができる。 As described above, in the table recognition device described in detail in the third embodiment, in the calculation of the likelihood in the ruled line frame integration determination unit, the cost value for character conversion of characters close to a ruled line is weighted lightly.
This reduces the influence of characters that are more likely to be erroneously recognized compared to other characters, thereby further improving the accuracy of the ruled line frame integration determination.
よって、他の文字と比較して誤認識されている可能性の高い文字の影響が抑制されるので、罫線枠統合判定の精度を更に向上させることができる。 As described above, in the table recognition device described in detail in the third embodiment, in the calculation of the likelihood in the ruled line frame integration determination unit, the cost value for character conversion of characters close to a ruled line is weighted lightly.
This reduces the influence of characters that are more likely to be erroneously recognized compared to other characters, thereby further improving the accuracy of the ruled line frame integration determination.
上述した実施の形態のそれぞれにおいて、2つの文字列の間の一致度の一例として、尤度について示したが、これに限定されるものではない。例えば、文字列をベクトル表現し、2つの文字列ベクトルの間のコサイン類似度を一致度として用いてもよい。例えば、コサイン類似度が1に近い場合は、2つの文字ベクトルは類似していることとなり、一致度が高くなる、一方、コサイン類似度が0に近い場合は、2つの文字ベクトルは類似しておらず、一致度が低くなる。
In each of the above-mentioned embodiments, likelihood has been shown as an example of the degree of similarity between two character strings, but this is not limiting. For example, character strings may be represented as vectors, and the cosine similarity between two character string vectors may be used as the degree of similarity. For example, when the cosine similarity is close to 1, the two character vectors are similar and the degree of similarity is high; on the other hand, when the cosine similarity is close to 0, the two character vectors are not similar and the degree of similarity is low.
なお、上述した実施の形態のそれぞれにおいて、罫線枠統合判定の処理は、横書き又は左横書き言語に限定されることは無い。例えば、上述した実施の形態による表認識装置は、縦書き文書のような行と列が入れ替わった表でも適用可能である。例えば、上述した実施の形態による表認識装置は、アラビア語のように右から記述が開始される言語でも適用可能である。
In each of the above-described embodiments, the process of determining whether or not to merge ruled lines is not limited to languages that are written horizontally or left-to-right. For example, the table recognition device according to the above-described embodiments can also be applied to tables in which the rows and columns are swapped, such as vertically written documents. For example, the table recognition device according to the above-described embodiments can also be applied to languages in which writing starts from the right, such as Arabic.
上記以外にも、同様の機能・効果が得られる構成であれば、それを用いた形態としてもよい。更に、本開示はその開示の範囲内において、実施の形態の任意の構成要素の変形、もしくは実施の形態の任意の構成要素の省略が可能である。
In addition to the above, any other configuration may be used as long as it provides similar functions and effects. Furthermore, within the scope of the present disclosure, any component of the embodiment may be modified or omitted.
In addition to the above, any other configuration may be used as long as it provides similar functions and effects. Furthermore, within the scope of the present disclosure, any component of the embodiment may be modified or omitted.
1 表構造認識部、2 文字認識部、3 罫線枠統合判定部、4 知識データベース、5 表構造知識データベース、
100 表認識装置、101 プロセッサ、102 メモリ、103 外部記憶装置、104 入出力インタフェース。 1 Table structure recognition unit, 2 Character recognition unit, 3 Ruled line frame integration determination unit, 4 Knowledge database, 5 Table structure knowledge database,
100 table recognition device, 101 processor, 102 memory, 103 external storage device, 104 input/output interface.
100 表認識装置、101 プロセッサ、102 メモリ、103 外部記憶装置、104 入出力インタフェース。 1 Table structure recognition unit, 2 Character recognition unit, 3 Ruled line frame integration determination unit, 4 Knowledge database, 5 Table structure knowledge database,
100 table recognition device, 101 processor, 102 memory, 103 external storage device, 104 input/output interface.
Claims (5)
- 表形式文書の画像情報から当該表形式文書に記述された文字列を認識する表認識装置であって、
前記表形式文書に設けられた複数の罫線枠内にそれぞれ記述された文字列を認識する文字認識部と、
前記複数の罫線枠のうちの対象となる罫線枠である対象罫線枠について認識された文字列である単独文字列と、前記対象罫線枠とは異なる前記罫線枠について認識された文字列と前記単独文字列とを連結した連結文字列のうち、
前記表形式文書に記述されるべき適合文字列との一致度が高い方の前記単独文字列もしくは前記連結文字列を、前記対象罫線枠に属する文字列である統合文字列として判別する罫線枠統合判定部と、を備えること
を特徴とする表認識装置。
A table recognition device that recognizes character strings described in a table-formatted document from image information of the table-formatted document,
a character recognition unit that recognizes character strings written within each of a plurality of ruled frames provided in the table format document;
a single character string that is a character string recognized for a target ruled frame that is a target ruled frame among the plurality of ruled frame, and a concatenated character string that is a concatenation of a character string recognized for a ruled frame other than the target ruled frame and the single character string,
a ruled line frame integration determination unit that determines the single string or the concatenated string that has a higher degree of match with a matching string to be written in the table format document as an integrated string that is a string that belongs to the target ruled line frame.
- 前記罫線枠統合判定部が、前記統合文字列とした前記単独文字列もしくは前記連結文字列の前記一致度が予め定められた閾値以上の場合、前記統合文字列を前記一致度を算出するのに用いた前記適合文字列に置換すること
を特徴とする請求項1に記載の表認識装置。
2. The table recognition device according to claim 1, wherein the ruled frame integration determination unit replaces the integrated string with the matching string used to calculate the degree of similarity when the degree of similarity of the single string or the concatenated string determined as the integrated string is equal to or greater than a predetermined threshold.
- 前記罫線枠統合判定部が、罫線枠毎に定められた複数の適合文字列の中から、前記罫線枠内に記述可能な適合文字列を制約する情報によって限定された1以上の適合文字列について前記一致度を算出し、前記統合文字列を判別すること
を特徴とする請求項1又は2に記載の表認識装置。
The table recognition device according to claim 1 or 2, characterized in that the ruled line frame integration determination unit calculates the degree of match for one or more matching character strings limited by information that restricts the matching character strings that can be written within the ruled line frame from among a plurality of matching character strings defined for each ruled line frame, and determines the integrated character string.
- 前記罫線枠統合判定部が、前記複数の罫線枠の罫線に近接する文字の文字変換にかかるコストの値を小さくするように重み付けして前記一致度を算出すること
を特徴とする請求項1から3のいずれかの1項に記載の表認識装置。
4. The table recognition device according to claim 1, wherein the ruled line frame integration determination unit calculates the degree of coincidence by weighting the cost of character conversion of characters close to the ruled lines of the plurality of ruled line frames so as to reduce the cost.
- 表形式文書の画像情報から当該表形式文書に記述された文字列を認識する表認識方法であって、
文字認識部が、前記表形式文書に設けられた複数の罫線枠内にそれぞれ記述された文字列を認識し、
罫線枠統合判定部が、前記複数の罫線枠のうちの対象となる罫線枠である対象罫線枠について認識された文字列である単独文字列と、前記対象罫線枠とは異なる前記罫線枠について認識された文字列と前記単独文字列とを連結した連結文字列のうち、
前記表形式文書に記述されるべき適合文字列との一致度が高い方の文字列もしくは連結文字列を、前記対象罫線枠に属する文字列である統合文字列として判別すること
を特徴とする表認識方法。 A table recognition method for recognizing a character string described in a table-formatted document from image information of the table-formatted document, comprising the steps of:
a character recognition unit that recognizes character strings written within each of a plurality of ruled frames provided in the table format document;
a single character string that is a character string recognized for a target ruled line frame that is a target ruled line frame among the plurality of ruled line frames, and a concatenated character string that is a concatenation of a character string recognized for a ruled line frame other than the target ruled line frame and the single character string,
A table recognition method comprising: determining a character string or a concatenated character string that has a higher degree of match with a matching character string to be written in the table format document as an integrated character string that belongs to the target ruled frame.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2022/038526 WO2024084539A1 (en) | 2022-10-17 | 2022-10-17 | Table recognition device and method |
JP2024523957A JP7563655B2 (en) | 2022-10-17 | 2022-10-17 | Table recognition device and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2022/038526 WO2024084539A1 (en) | 2022-10-17 | 2022-10-17 | Table recognition device and method |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024084539A1 true WO2024084539A1 (en) | 2024-04-25 |
Family
ID=90737311
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2022/038526 WO2024084539A1 (en) | 2022-10-17 | 2022-10-17 | Table recognition device and method |
Country Status (2)
Country | Link |
---|---|
JP (1) | JP7563655B2 (en) |
WO (1) | WO2024084539A1 (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014068770A1 (en) * | 2012-11-02 | 2014-05-08 | 株式会社日立製作所 | Data extraction method, data extraction device, and program thereof |
-
2022
- 2022-10-17 JP JP2024523957A patent/JP7563655B2/en active Active
- 2022-10-17 WO PCT/JP2022/038526 patent/WO2024084539A1/en unknown
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014068770A1 (en) * | 2012-11-02 | 2014-05-08 | 株式会社日立製作所 | Data extraction method, data extraction device, and program thereof |
Also Published As
Publication number | Publication date |
---|---|
JP7563655B2 (en) | 2024-10-08 |
JPWO2024084539A1 (en) | 2024-04-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111737969B (en) | Resume parsing method and system based on deep learning | |
WO2021135444A1 (en) | Text error correction method and apparatus based on artificial intelligence, computer device and storage medium | |
JPH058464B2 (en) | ||
KR20230061001A (en) | Apparatus and method for correcting text | |
Chang | A new approach for automatic Chinese spelling correction | |
US20230177266A1 (en) | Sentence extracting device and sentence extracting method | |
CN113779970A (en) | Text error correction method and related equipment thereof | |
CN110532569B (en) | Data collision method and system based on Chinese word segmentation | |
JPH0634256B2 (en) | Contact character cutting method | |
CN113673294B (en) | Method, device, computer equipment and storage medium for extracting document key information | |
US7130487B1 (en) | Searching method, searching device, and recorded medium | |
WO2024084539A1 (en) | Table recognition device and method | |
Mohapatra et al. | Spell checker for OCR | |
US11183191B2 (en) | Information processing apparatus and non-transitory computer readable medium | |
CN113362026A (en) | Text processing method and device | |
JP6303508B2 (en) | Document analysis apparatus, document analysis system, document analysis method, and program | |
JP4101345B2 (en) | Character recognition device | |
CN115410207B (en) | Detection method and device for vertical text | |
JP5252487B2 (en) | Information processing apparatus, control method thereof, control program, and recording medium | |
CN117454893B (en) | Python-based intelligent word segmentation method, system, equipment and storage medium | |
Vinitha | Error detection and correction in Indic OCRs | |
JP5289032B2 (en) | Document search device | |
JP2009086911A (en) | Unique expression extraction device, its method, program, and recording medium | |
JP3033554B2 (en) | Character recognition device | |
JP2002014981A (en) | Document filing device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2024523957 Country of ref document: JP Kind code of ref document: A |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22962651 Country of ref document: EP Kind code of ref document: A1 |