WO2024202018A1 - バリュー抽出システム、バリュー抽出方法、及びプログラム - Google Patents

バリュー抽出システム、バリュー抽出方法、及びプログラム Download PDF

Info

Publication number
WO2024202018A1
WO2024202018A1 PCT/JP2023/013598 JP2023013598W WO2024202018A1 WO 2024202018 A1 WO2024202018 A1 WO 2024202018A1 JP 2023013598 W JP2023013598 W JP 2023013598W WO 2024202018 A1 WO2024202018 A1 WO 2024202018A1
Authority
WO
WIPO (PCT)
Prior art keywords
estimated
training
value
key
value extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/JP2023/013598
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
永男 蔡
美廷 金
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rakuten Group Inc
Original Assignee
Rakuten Group Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rakuten Group Inc filed Critical Rakuten Group Inc
Priority to JP2025509603A priority Critical patent/JPWO2024202018A1/ja
Priority to PCT/JP2023/013598 priority patent/WO2024202018A1/ja
Publication of WO2024202018A1 publication Critical patent/WO2024202018A1/ja
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/192Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
    • G06V30/194References adjustable by an adaptive method, e.g. learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables

Definitions

  • This disclosure relates to a value extraction system, a value extraction method, and a program.
  • Non-Patent Documents 1 to 4 describe techniques for analyzing the layout of a document based on a learning model in which the layouts of various documents have been learned, and the coordinates of cells (bounding boxes) that contain the components of a document shown in an image.
  • the layout of a document is analyzed based on the coordinates of each cell in the entire image.
  • One of the objectives of this disclosure is to accurately extract the value corresponding to a key.
  • the value extraction system includes an estimated image acquisition unit that acquires an estimated image showing an estimated document including an estimated key and an estimated value, a designated key acquisition unit that acquires a designated key designated by a user, and an extraction unit that extracts the estimated value from the estimated image based on a value extraction model that has learned the relative positional relationship between the estimated image, the designated key, and the training key and training value included in the training document shown in the training image.
  • FIG. 1 is a diagram illustrating an example of an overall configuration of a value extraction system.
  • FIG. 13 is a diagram showing an example of a state in which a user photographs an inferred document.
  • FIG. 13 is a diagram showing an example of an estimated image obtained by executing character recognition.
  • FIG. 13 is a diagram illustrating an example of the relationship between input and output of a trained value extraction model.
  • FIG. 2 is a diagram illustrating an example of functions realized by the value extraction system.
  • FIG. 2 is a diagram illustrating an example of a training database.
  • FIG. 2 is a diagram illustrating an example of a process executed in the value extraction system.
  • FIG. 13 is a diagram illustrating an example of functions realized in a value extraction system according to a modified example.
  • FIG. 13 is a diagram showing an example of a training database according to the first modified example.
  • Fig. 1 is a diagram showing an example of the overall configuration of a value extraction system.
  • the value extraction system 1 includes a server 10 and a user terminal 20.
  • Each of the server 10 and the user terminal 20 can be connected to a network N such as the Internet or a LAN.
  • the server 10 is a server computer.
  • the control unit 11 includes at least one processor.
  • the storage unit 12 includes a volatile memory such as RAM and a non-volatile memory such as a flash memory.
  • the communication unit 13 includes at least one of a communication interface for wired communication and a communication interface for wireless communication.
  • the user terminal 20 is a user's computer.
  • the user terminal 20 is a personal computer, a tablet terminal, a smartphone, or a wearable terminal.
  • the physical configurations of the control unit 21, the memory unit 22, and the communication unit 23 are similar to those of the control unit 11, the memory unit 12, and the communication unit 13, respectively.
  • the operation unit 24 is an input device such as a touch panel or a mouse.
  • the display unit 25 is a liquid crystal display or an organic EL display.
  • the shooting unit 26 includes at least one camera.
  • the programs stored in the storage units 12, 22 may be supplied via the network N.
  • Each of the server 10 and the user terminal 20 may also include at least one of a reading unit (e.g., a memory card slot) that reads a computer-readable information storage medium, and an input/output unit (e.g., a USB port) for inputting and outputting data to and from an external device.
  • a reading unit e.g., a memory card slot
  • an input/output unit e.g., a USB port
  • a program stored in an information storage medium may be supplied via at least one of the reading unit and the input/output unit.
  • the value extraction system 1 is not limited to the example of FIG. 1 as long as it includes at least one computer.
  • the value extraction system 1 may include only the server 10 without including the user terminal 20.
  • the user terminal 20 exists outside the value extraction system 1.
  • the value extraction system 1 may include another computer other than the server 10, and the processing described in this embodiment may be executed by the other computer.
  • the other computer is a personal computer, a tablet terminal, or a smartphone.
  • the value extraction model is a model that extracts a value corresponding to a key from an image showing a document containing the key and the value.
  • the value extraction model is a model that utilizes a machine learning technique.
  • the value extraction model is a Vision Transformer-based model, but the machine learning technique itself can utilize various techniques used in the field of image processing.
  • the value extraction model may be a model that utilizes a neural network or a support vector machine.
  • a key is information that indicates the meaning of a value.
  • a key can also be called an attribute, explanation, or heading of a value.
  • a value is information that indicates a specific value of a key.
  • a value can also be called details or content of a key.
  • a key corresponds to an attribute value.
  • the key and the value are each a character string (text), but each of the key and the value may be in any format and is not limited to a character string.
  • at least one of the key and the value may be a symbol string that is not classified as a character string, a barcode, a two-dimensional code, or an icon.
  • the format of the key and the format of the value may be different, such as when the key is a character string and the value is a two-dimensional code.
  • a document is a document that contains information that a human can see and understand.
  • a document is a piece of paper or a card with characters written on it.
  • a receipt is used as an example of the document, but the document may be of any type and is not limited to a receipt.
  • the document may be an invoice, estimate, application form, official document, internal company document, flyer, paper, magazine, newspaper, reference book, or identity document.
  • the value extraction model is capable of extracting value from an image showing any document.
  • the value extraction model may be trained on multiple types of documents, not just one type of document.
  • a document includes at least one key and at least one value.
  • a document includes one key and one value, but a document may include one key and multiple values.
  • a document may include multiple keys and one value.
  • a document may include multiple keys and multiple values. That is, the relationship between keys and values may be one-to-one, one-to-many, many-to-one, or many-to-many.
  • a document may include information other than keys and values.
  • an image to be trained by the value extraction model is referred to as a training image.
  • a document shown in a training image is referred to as a training document.
  • a key and a value contained in a training document are referred to as a training key and a training value, respectively.
  • An image to be estimated by a trained value extraction model is referred to as an estimated image.
  • a document shown in an estimated image is referred to as an estimated document.
  • a key and a value contained in an estimated document are referred to as an estimated key and an estimated value, respectively.
  • a case where a user holds an estimated document is taken as an example. For example, a user photographs an estimated document with the photographing unit 26.
  • FIG. 2 is a diagram showing an example of a user photographing an estimated document.
  • the user terminal 20 when a user photographs the estimated document ED with the photographing unit 26, the user terminal 20 generates an estimated image EI showing the estimated document ED.
  • the x-axis and y-axis are set with the upper left corner of the estimated image EI as the origin O.
  • Positions within the estimated image EI are shown in two-dimensional coordinates including x-coordinates and y-coordinates.
  • Positions within the estimated image EI can be expressed in any coordinate system and are not limited to the example of FIG. 2.
  • positions within the estimated image EI may be expressed in a coordinate system with the center of the estimated image EI as the origin O, or in a polar coordinate system.
  • the character string "Payment” printed on the estimated document ED corresponds to the estimated key
  • the character string indicating the payment method printed on the estimated document ED (“EEE-pay” in the example of Figure 2) corresponds to the estimated value.
  • the estimated key and estimated value may be any information that pairs with each other, and are not limited to the example of this embodiment.
  • the character string "Total” printed on the estimated document ED may correspond to the estimated key
  • the total amount printed on the estimated document ED may correspond to the estimated value
  • the character string "Tel” printed on the estimated document ED may correspond to the estimated key
  • the telephone number printed on the estimated document ED may correspond to the estimated value.
  • the user specifies a key that is the same as the estimated key from the operation unit 24.
  • the key specified by the user is referred to as the designated key.
  • the user specifies the designated key by inputting an arbitrary character string, or by selecting the designated key from multiple candidates.
  • the user terminal 20 transmits the designated key and the estimated image EI to the server 10.
  • the server 10 receives the designated key and the estimated image EI from the user terminal 20, it performs character recognition on the estimated image EI.
  • FIG. 3 is a diagram showing an example of an estimated image EI on which character recognition has been performed.
  • the server 10 uses a known character recognition tool to detect estimated cells EC1 to EC15 from within the estimated image EI.
  • estimated cells EC may have any shape and are not limited to a rectangle as shown in FIG. 3.
  • the estimated cells EC may be a square, a rectangle with rounded corners, a polygon other than a rectangle, or an ellipse.
  • the estimated cell EC is an area that includes at least one character.
  • the estimated cell EC is sometimes called a bounding box.
  • the estimated cell EC is detected using a character recognition tool, and therefore includes at least one character.
  • the estimated cell EC may be detected for each character, but in this embodiment, multiple consecutive characters are detected as one estimated cell EC. For example, if the space between characters is small, one estimated cell EC that includes multiple words separated by spaces may be detected.
  • the server 10 identifies the position of the estimated cell EC by a matrix rather than by coordinates. For example, the server 10 classifies estimated cells EC whose coordinates are close to each other into the same row or column. This allows the server 10 to absorb slight differences in coordinates between estimated cells EC that are actually in the same row or column.
  • estimated cell EC1 is in the 0th row and 1st column.
  • Estimated cell EC2 is in the 1st row and 0th column.
  • Estimated cell EC3 is in the 2nd row and 0th column.
  • the positions of estimated cells EC4 to EC15 are identified by a matrix.
  • the server 10 acquires estimated cell information for each of the multiple estimated cells EC.
  • the estimated cell information may be any information related to the estimated cell EC.
  • the estimated cell information includes at least one of the matrix position of the estimated cell EC, the character string contained in the estimated cell EC, the coordinates of the estimated cell EC, the width of the estimated cell EC, and image data within the estimated cell EC.
  • the server 10 inputs the estimated cell information for each of the multiple estimated cells EC and the specified key to the trained value extraction model.
  • Figure 4 is a diagram showing an example of the input and output relationship of a trained value extraction model.
  • the value extraction model M identifies an estimated cell EC from the plurality of estimated cells EC whose estimated cell information includes an estimated key, which is a character string that matches a specified key.
  • the value extraction model M identifies that the estimated key is included in estimated cell EC14.
  • the value extraction model M estimates that the estimated value is included in estimated cell EC15, which is in a predetermined positional relationship with estimated cell EC14.
  • the value extraction model M For example, if the value extraction model M has learned that the training value is in the same row as the training key and to the right of the training key, the value extraction model M estimates that the estimated value is contained in estimated cell EC15, which is in the same row as estimated cell EC14 and to the right of estimated cell EC14. Therefore, the value extraction model M outputs the character string "EEE-Pay" contained in estimated cell EC15 as the estimated value.
  • the server 10 transmits the estimated value output from the value extraction model M to the user terminal 20.
  • the value extraction model M it is assumed that various training documents and the relative positional relationships between various training keys and training values have been learned by the value extraction model M.
  • the relative positional relationships between training keys and training values contained in various training documents such as not only receipts but also invoices and estimates have been learned by the value extraction model M. Since the relative positional relationships between various training keys and training values have been learned, when the user specifies a designated key, the value extraction model M can estimate the position of an estimated value that corresponds to the same estimated key as the specified key.
  • the value extraction model M has learned the relative positional relationship between the training key and the training value.
  • the value extraction system 1 can extract an estimated value corresponding to the estimated key with high accuracy by extracting an estimated value from the estimated image EI based on the estimated image EI, the specified key, and the learned value extraction model M. The details of this embodiment will be described below.
  • FIG. 5 is a diagram showing an example of functions realized by the value extraction system 1. As shown in FIG.
  • the server 10 includes a data storage unit 100, a training data generation unit 101, a learning unit 102, an estimated image acquisition unit 103, a designated key acquisition unit 104, and an extraction unit 105.
  • the data storage unit 100 is realized by the storage unit 12 shown in Fig. 1.
  • the training data generation unit 101, the learning unit 102, the estimated image acquisition unit 103, the designated key acquisition unit 104, and the extraction unit 105 are realized by the control unit 11 shown in Fig. 1.
  • the data storage unit 100 stores data necessary for at least one of learning the value extraction model M and estimation based on the learned value extraction model M.
  • the data storage unit 100 stores a training database DB in which training data to be learned by the value extraction model M is stored.
  • FIG. 6 is a diagram showing an example of a training database DB. "No.” is identification information of the training data.
  • the training data includes an input portion that is input to the value extraction model M during learning, and an output portion that should be output from the value extraction model M during learning.
  • the format of the input portion of the training data is the same as the format of the input data that is input to the value extraction model M during estimation.
  • the format of the output portion of the training data is the same as the format of the output data that is output from the value extraction model M during estimation.
  • the input portion of the training data includes information regarding the positions of the training keys.
  • the output portion of the training data includes information regarding the positions of the training values. Since the training data includes these input and output portions, the relative positional relationship between the training keys and the training values is indicated in the training data.
  • the position is represented by a matrix, but the position may be represented in other formats such as coordinates or a vector.
  • the input portion of the training data includes training cell information about training cells extracted from the training image.
  • Training cells differ from estimated cells in that they are extracted from the training image, but are similar to estimated cells in other respects. For this reason, the description of an estimated cell with the word "estimated” replaced with "training” corresponds to the description of the training cell.
  • the training cell information includes at least one of the matrix position of the training cell, the character string contained in the training cell, the coordinates of the training cell, the width of the training cell, and image data within the training cell.
  • a training cell that contains a training key is referred to as a training key cell. Since other training cells that contain character strings other than the training key are also extracted from the training image, when there is no need to distinguish between other training cells and training key cells, they are simply referred to as training cells.
  • the example in Figure 6 shows a case where the input portion of the training data contains only a training key cell, but the input portion of the training data may also contain training cell information of other training cells.
  • the output portion of the training data includes training cell information related to the training values extracted from the training images.
  • a training cell that includes a training value is referred to as a training value cell. Since other training cells that include character strings other than the training value are also extracted from the training images, when there is no need to distinguish between other training cells and training value cells, they are simply referred to as training cells.
  • the example in Figure 6 shows a case where the output portion of the training data includes only training value cells, but the output portion of the training data may also include other training cells.
  • the matrix indicating the position of the training value cell is based on the position of the training key cell.
  • the training value cell matrix when the matrix of the training key cell is (0,0) is shown in the output portion of the training data.
  • the input portion of the first training data contains a matrix (0,0) indicating the position of the training key "Payment”.
  • the training value of the first training data is in the same row as the training key "Payment” and is located one column away to the right, so the output portion of the first training data contains a matrix (0,1) indicating the position of this training value.
  • the training data may indicate the absolute position of a matrix in the training image, rather than a matrix based on the position of the training key cell.
  • the training data may indicate the relative positional relationship between the training key cell and the training value cell.
  • the training data may include coordinates indicating the position of the training key cell and coordinates indicating the position of the training value. If the training cell is not used, the training data may include coordinates or a matrix indicating the position of the training key and coordinates or a matrix indicating the position of the training value.
  • the data storage unit 100 stores a value extraction model M before learning.
  • the value extraction model M includes a program and parameters. The parameters are adjusted by learning.
  • the value extraction model M before learning is a value extraction model M with parameters at initial values.
  • the value extraction model M is a Vision Transformer-based model.
  • the Vision Transformer is a method that applies the Transformer, which is mainly used in natural language processing, to image processing.
  • Vision Transformer analyzes the connections between components of a document in input data arranged in chronological order. For example, Vision Transformer divides an input image input to it into multiple patches and obtains input data in which multiple patches are arranged. Vision Transformer is a method that applies the context analysis performed by Transformer to analyze the connections between patches. Vision Transformer converts each patch included in the input data into a vector and analyzes it. The value extraction model M of this embodiment applies the mechanism of this Vision Transformer.
  • the data stored in the data storage unit 100 is not limited to the above examples.
  • the data storage unit 100 only needs to store data necessary for at least one of the learning of the value extraction model M and the estimation based on the learned value extraction model M, and can store any data.
  • the data storage unit 100 may store a program for executing the learning of the value extraction model M, an estimated image database in which estimated images EI are stored, a training image database in which training images are stored, and a character recognition tool.
  • the training data generating unit 101 generates training data.
  • the training data generating unit 101 generates training data from a training image, but the creator who creates the value extraction model M may manually generate the training data. For example, the creator visually checks the training image and annotates the positions of the training keys and the positions of the training values.
  • the training data generating unit 101 may generate training data based on the annotation results by the creator.
  • the training data generation unit 101 generates training data including training key cell information regarding the position of a training key cell including a training key, and training value cell information regarding the position of a training value cell including a training value, based on character recognition of the training image.
  • the training data generation unit 101 performs character recognition on the training image similar to that of the estimated image EI described in FIG. 3.
  • the training data generation unit 101 identifies a training key cell and a training value cell from among multiple training cells extracted from the training image. It is assumed that the training key and training value are specified in advance by the creator.
  • the training data generation unit 101 identifies, from among multiple training cells, a training cell that contains a character string that matches the training key as a training key cell.
  • the training data generation unit 101 identifies, from among multiple training cells, a training cell that contains a character string that matches the training value as a training value cell.
  • the character string match may be a perfect match or a partial match.
  • the creator may visually check the training cell and annotate the training key cell and training value cell. In this case, the training data generation unit 101 identifies the training key cell and training value cell based on the creator's annotation results.
  • the training data generation unit 101 calculates the matrix position of the training value cell based on the matrix position of the training key cell.
  • the training data generation unit 101 generates a pair of (0,0) indicating the matrix position of the training key cell and a character string indicating the training key as the input part of the training data.
  • the training data generation unit 101 generates the matrix position of the training value cell based on the matrix position of the training key cell as the output part of the training data.
  • the training data generation unit 101 generates training data by combining the generated input part and output part.
  • the training data generation unit 101 performs similar processing on other training images to generate training data one after another.
  • the learning unit 102 executes learning of the value extraction model M based on the training data.
  • the learning itself can utilize various techniques used in machine learning techniques.
  • the learning unit 102 may execute learning of the value extraction model M based on the backpropagation method or the gradient descent method.
  • the learning unit 102 adjusts the parameters of the value extraction model M so that an output portion of the training data is output when an input portion of the training data is input.
  • the learning unit 102 executes learning of the value extraction model M until the loss calculated based on the loss function becomes sufficiently small.
  • the learning unit 102 records the learned value extraction model M in the data storage unit 100.
  • the estimated image acquisition unit 103 acquires an estimated image EI showing an estimated document ED including an estimated key and an estimated value.
  • the estimated image acquisition unit 103 acquires the estimated image EI from the user terminal 20.
  • the estimated image EI may be recorded in advance in the data storage unit 100 or an external storage medium. In this case, the estimated image acquisition unit 103 may acquire the estimated image EI from the data storage unit 100 or the external storage medium.
  • the designated key acquisition unit 104 acquires a designated key designated by a user.
  • the designated key acquisition unit 104 acquires the designated key from the user terminal 20.
  • the designated key may be recorded in advance in the data storage unit 100 or an external storage medium. In this case, the designated key acquisition unit 104 may acquire the designated key from the data storage unit 100 or the external storage medium.
  • the extraction unit 105 extracts an estimated value from the estimated image EI based on the estimated image EI, the designated key, and a value extraction model M that has learned the relative positional relationship between the training key and the training value contained in the training document shown in the training image.
  • the extraction unit 105 acquires input data to be input to the value extraction model M based on the estimated image EI and the designated key.
  • the extraction unit 105 inputs the input data to the value extraction model M.
  • the value extraction model M calculates the feature amount of the input data based on the parameters adjusted by learning.
  • the value extraction model M outputs an estimation result of the estimated value based on the feature amount of the input data.
  • the extraction unit 105 extracts an estimated value from the estimated image EI by acquiring the estimated value output from the value extraction model M.
  • the value extraction model M has learned the relative positional relationship between the rows and columns of the training keys in the training image and the rows and columns of the training values in the training image. For example, the row and column positions of the training values based on the row and column positions of the training keys have been learned by the value extraction model M.
  • the value extraction model M may have learned the relationship between the row and column positions of the training keys and the row and column positions of the training values based on the origin of the training image, rather than based on the row and column positions of the training keys.
  • the extraction unit 105 obtains the row and column of the specified key based on the results of performing character recognition on the estimated image EI.
  • the extraction unit 105 may perform character recognition on the estimated image EI based on a method other than optical character recognition.
  • the extraction unit 105 may perform character recognition based on template matching that compares with a template image, a method that analyzes the distribution of pixel values as a histogram, a machine learning method such as a neural network, or other methods.
  • the character recognition by the training data generation unit 101 may be various methods including optical character recognition.
  • the extraction unit 105 extracts estimated cells, which are cells that contain at least one character, from the estimated image EI. Based on the coordinates of each of the multiple estimated cells, the extraction unit 105 acquires the rows and columns of each of the multiple estimated cells such that estimated cells with close x coordinates belong to the same column and estimated cells with close y coordinates belong to the same row. Acquiring rows and columns means identifying the positions of each of the rows and columns. For example, when row numbers and column numbers are assigned in order from the top left of the estimated image EI, the extraction unit 105 acquires the row numbers and column numbers of each of the multiple estimated cells.
  • the extraction unit 105 extracts an estimated value based on the row and column of the specified key and the value extraction model M.
  • an example is given of a case where the process of identifying the row and column of the specified key is executed inside the value extraction model M, but the process of identifying the row and column of the specified key may be executed outside the value extraction model M.
  • the extraction unit 105 may identify the row and column of the specified key and then input the identified row and column of the specified key to the value extraction model M. That is, the extraction unit 105 may identify the row and column of the specified key as a preprocessing of estimation by the value extraction model M.
  • the extraction unit 105 inputs each of the multiple pieces of estimated cell information extracted from the estimated image EI and the designated key to the value extraction model M.
  • the value extraction model M identifies, from the multiple pieces of estimated cell information, estimated cell information that includes a character string that is the same as or similar to the designated key.
  • the value extraction model M identifies the row and column indicated by the identified estimated cell information as the row and column of the designated key.
  • the value extraction model M estimates and outputs the row and column of the estimated key based on the row and column of the identified designated key.
  • the extraction unit 105 extracts an estimated value from the estimated image EI by obtaining the row and column of the estimated key output from the value extraction model M.
  • the value extraction model M has learned the relative positional relationship between training key cell information relating to the position of a training key cell that includes a training key, and training value cell information relating to the position of a training value cell that includes a training value.
  • the extraction unit 105 acquires estimated key cell information relating to the position of an estimated key cell that includes an estimated key, based on character recognition of the estimated image EI.
  • the extraction unit 105 extracts an estimated value based on the estimated key cell information and the value extraction model M.
  • the extraction unit 105 extracts an estimated value based on the value extraction model M learned by the learning unit 102.
  • the learning of the value extraction model M may be performed by a computer other than the server 10.
  • the data storage unit 100 stores the value extraction model M learned by the other computer.
  • the extraction unit 105 may extract an estimated value based on the value extraction model M learned by the other computer.
  • the value extraction model M learns the relative positional relationship between one training key and one training value corresponding to that one training key. That is, the training key and the training value have a one-to-one correspondence.
  • the estimated key and the estimated value also have a one-to-one correspondence.
  • the extraction unit 105 extracts one estimated value corresponding to one specified key from the estimated image EI.
  • the process executed by the extraction unit 105 is not limited to the above example.
  • the format of the input data input to the value extraction model M and the output data output from the value extraction model M may be any predetermined format and is not limited to the example of this embodiment.
  • the extraction unit 105 may input a pair of the estimated image EI and the designated key directly to the value extraction model M as input data.
  • the value extraction model M performs convolution of the estimated image EI or the like to calculate the feature amount of the estimated image EI and executes an estimation according to the feature amount and the designated key.
  • the value extraction model M may output a character string of the estimated value instead of outputting estimated cell information related to the position of the estimated value cell.
  • the data storage unit 200 is realized mainly by the storage unit 22.
  • the transmission unit 201 and the reception unit 202 are realized mainly by the control unit 21.
  • the data storage unit 200 stores data necessary for acquiring the estimated image EI.
  • the data storage unit 200 stores the estimated image EI generated by the imaging unit 26.
  • the transmission unit 201 transmits various data to the server 10. For example, the transmission unit 201 transmits an estimated image EI to the server 10.
  • the receiving unit 202 receives various data from the server 10. For example, the receiving unit 202 receives from the server 10 an estimated value estimated by the value extraction model M. The user terminal 20 causes the display unit 25 to display the estimated value.
  • Fig. 7 is a diagram showing an example of a process executed by the value extraction system 1.
  • the control units 11 and 21 execute programs stored in the storage units 12 and 22, respectively, to execute the process of Fig. 7.
  • the server 10 executes character recognition on a training image to extract a training key cell and a training value cell, and generates training data including training key cell information and training value cell information (S1).
  • the server 10 executes learning of the value extraction model M based on the training data stored in the training database DB (S2).
  • the user terminal 20 When the user photographs the estimated document ED with the photographing unit 26, the user terminal 20 generates an estimated image EI and transmits it to the server 10 (S3).
  • the server 10 receives the estimated image EI from the user terminal 20 (S4).
  • the user terminal 20 accepts the user's designation of a designated key and transmits it to the server 10 (S5).
  • the user terminal 20 receives the designated key from the user terminal 20 (S6).
  • the user terminal 20 may transmit the estimated document ED and the designated key to the server 10 all at once.
  • the server 10 performs character recognition on the estimated image EI to extract the estimated cells EC (S7).
  • the server 10 acquires the cell information of the estimated cells EC other than the row number and column number.
  • the server 10 acquires the cell information of each of the multiple estimated cells EC by assigning the same row number to the estimated cells EC that belong to the same row based on the y coordinate of each of the multiple estimated cells EC, and assigning the same column number to the estimated cells EC that belong to the same column based on the x coordinate of each of the multiple estimated cells EC.
  • the server 10 inputs the estimated cell information of each of the multiple estimated cells EC and the designation key to the value extraction model M (S8).
  • the value extraction model M calculates the features of the input data input to itself and outputs an estimation result according to the features.
  • the server 10 extracts an estimated value from the estimated image EI based on the output from the value extraction model M (S9).
  • the server 10 transmits the estimated value to the user terminal 20 (S10).
  • the user terminal 20 receives the estimated value from the server 10 (S11), and this process ends.
  • the value extraction system 1 of this embodiment extracts an estimated value from the estimated image EI based on the estimated image EI, the designated key, and the learned value extraction model M. This allows the estimated value corresponding to the estimated key to be extracted with high accuracy. For example, as in Non-Patent Documents 1 to 4, when the model is made to learn the position of the estimated value in the entire image, it can only handle documents with a certain specific layout, but if the relative positional relationship between the estimated key and the estimated value is similar to the relative positional relationship between the training key and the training value, the value extraction system 1 can extract the estimated value even if the estimated document ED is unknown. Therefore, the value extraction system 1 can handle various estimated values of various estimated documents ED, thereby improving the versatility of the value extraction model M.
  • the value extraction system 1 also obtains the row and column of the specified key based on the results of executing character recognition on the estimated image EI, and extracts an estimated value based on the row and column of the specified key and the value extraction model M. This allows the value extraction system 1 to extract an estimated value after absorbing slight deviations in the coordinates indicating the position of the estimated cell EC, thereby further improving the accuracy of extracting the estimated value.
  • the value extraction system 1 also acquires estimated key cell information relating to the position of an estimated key cell including an estimated key based on character recognition of the estimated image EI.
  • the value extraction system 1 extracts an estimated value based on the estimated key cell information and the value extraction model M. This allows the value extraction system 1 to use character recognition to create a state in which the value extraction model M can easily identify the tendency of the estimated image EI, thereby further improving the accuracy of extraction of the estimated value.
  • the value extraction system 1 also generates training data including training key cell information and training value cell information based on character recognition of the training image.
  • the value extraction system 1 executes learning of the value extraction model M based on the training data.
  • the value extraction system 1 extracts estimated values based on the learned value extraction model M. This allows the value extraction system 1 to automate the process of generating training data, thereby reducing the effort required for the creator of the value extraction model M to prepare training data.
  • the value extraction model M also learns the relative positional relationship between one training key and one training value corresponding to that one training key.
  • the value extraction system 1 extracts one estimated value corresponding to one specified key from the estimated image EI. This allows the value extraction system 1 to extract an estimated value from the estimated document ED in which the estimated key and the estimated value correspond one-to-one.
  • FIG. 8 is a diagram showing an example of functions realized by the modified value extraction system 1.
  • the modified server 10 includes an additional generation unit 106, an additional learning unit 107, a language identification unit 108, and a feature identification unit 109.
  • the additional generation unit 106, the additional learning unit 107, the language identification unit 108, and the feature identification unit 109 are realized by the control unit 11.
  • the position of the training value relative to the training key is learned by the value extraction model M.
  • the training value is not necessarily located in one direction relative to the training key.
  • the value extraction model M may learn whether or not a training value exists in each of a plurality of directions, relative to the position of the training key in the training image.
  • FIG. 9 is a diagram showing an example of a training database DB of variant example 1.
  • the training data indicates values indicating whether or not a training value exists in each of eight directions based on the training key: top left, top, top right, left, right, bottom left, bottom, and bottom right.
  • the output portion of the training data includes eight values.
  • a value of "0" means that a training value does not exist.
  • a value of "1" means that a training value exists.
  • the value corresponding to the right of the training key is “1", indicating that a training value exists to the right of the training key.
  • the value corresponding to the top left of the training key is "1”
  • the value corresponding to the top right of the training key is "1”
  • the values for each of the multiple directions may be "1".
  • the learning unit 102 of the first modified example executes learning of the value extraction model M so that when the input portion of the training data in FIG. 9 is input to the value extraction model M, the value extraction model M outputs the output portion of the training data in FIG. 9.
  • the value extraction model M of the first modified example outputs a value indicating whether or not a training value exists in each of a plurality of directions.
  • the value extraction model M outputs eight values.
  • the value extraction model M outputs eight values in a vector format, an array format, or another format.
  • the extraction unit 105 of the first modified example extracts an estimated value by having the value extraction model M determine whether or not an estimated value exists in each of a plurality of directions based on the position of the designated key in the estimated image EI. For example, the extraction unit 105 inputs the same estimated cell information and designated key as in the embodiment to the value extraction model M.
  • the value extraction model M of the first modified example has learned whether or not a training value exists in each of a plurality of directions for the training key, and therefore estimates whether or not an estimated value exists in each of a plurality of directions based on the features of the estimated cell information.
  • the value extraction model M outputs a value indicating whether or not an estimated value exists in each of multiple directions.
  • the value extraction model M outputs a value indicating whether or not an estimated value exists in each of eight directions based on the estimated key.
  • the extraction unit 105 obtains the eight values output from the value extraction model M. Of the eight values, the extraction unit 105 obtains, as the estimated key, a character string included in the estimated cell of the direction indicating that an estimated value exists.
  • the value extraction system 1 of the first modified example extracts an estimated value by having the value extraction model M determine whether or not an estimated value exists in each of a plurality of directions based on the position of the specified key in the estimated image EI. This allows the value extraction system 1 to extract an estimated value even if there are a plurality of directions in which an estimated value exists for the estimated key.
  • the value extraction model M may estimate that an estimated value exists in each of a plurality of directions with respect to an estimated key. In this case, if only one estimated value exists for one estimated key, only the estimation result in one direction is correct, and the estimation results in the remaining directions are incorrect. If the direction in which an estimated value is likely to exist for the estimated key is specified in advance, a priority order may be set for each of the plurality of directions.
  • the data storage unit 100 of variant 2 stores data indicating the relationship between each of the multiple directions and the priority order.
  • a priority order is set for each of the eight directions.
  • the right has the highest priority order
  • the top right and bottom right have the same priority order and are the second highest.
  • the other directions have the lowest priority order.
  • the extraction unit 105 extracts an estimated value based on the priority order of each of the multiple directions.
  • the extraction unit 105 extracts, as the estimated value, a character string in the direction with the highest priority among the multiple directions in which it has been determined that an estimated value exists.
  • the extraction unit 105 may select a predetermined number of directions in descending order of priority among the multiple directions in which it has been determined that an estimated value exists, and extract a character string in the selected directions as the estimated value.
  • the extraction unit 105 may select all directions with a priority equal to or higher than a threshold among the multiple directions in which it has been determined that an estimated value exists, and extract a character string in the selected directions as the estimated value.
  • the value extraction system 1 of the second modified example extracts the estimated value based on the priority order of each of the multiple directions. This allows the value extraction system 1 to improve the accuracy of extracting the estimated value.
  • the estimated document ED and the training document are both receipts and are the same type.
  • the estimated document ED and the training document may be different types.
  • the estimated document ED may be a receipt and the training document may be an estimate.
  • the type of the estimated document ED and the type of the training document may be different.
  • the value extraction model M of variant 3 has learned the relative positional relationship between the training key and training value contained in the first type of training document.
  • the first type of training document is assumed to be an estimate.
  • the training data shows the relative positional relationship between the training key and training value contained in a training image showing an estimate, which is a training document.
  • the extraction unit 105 extracts the estimated value from an estimated image EI showing a second type of estimated document ED that is different from the first type.
  • the second type of estimated document ED is assumed to be a receipt. This variant differs from the embodiment in that the types of the estimated document ED and the training document are different from each other, but is similar to the embodiment in other respects.
  • the value extraction system 1 of the third modified example extracts an estimated value from an estimated image EI in which a second type of estimated document ED, which is different from the first type, which is the type of training document learned by the value extraction model M, is shown. This allows the value extraction system 1 to extract an estimated value from the estimated image EI even if the types of the estimated document ED and the training document are different from each other, thereby improving the versatility of the value extraction model M.
  • the value extraction model M of variant 4 has learned the relative positional relationship between each of the multiple training keys and the training value corresponding to the training key.
  • the estimated document ED includes each of the multiple estimated keys and the estimated value corresponding to the estimated key. These points are as described in the embodiment.
  • the designated key acquisition unit 104 of variant 4 acquires multiple designated keys. For example, the user designates multiple designated keys.
  • the designated key acquisition unit 104 acquires multiple designated keys from the user terminal 20.
  • the multiple designated keys may be predetermined rather than designated by the user.
  • the extraction unit 105 of the fourth modified example extracts, from the estimated image EI, an estimated value corresponding to each of the multiple estimated keys, based on each of the multiple designated keys. For example, the extraction unit 105 executes the extraction of the estimated value described in the embodiment for each designated key. The extraction unit 105 inputs each of the multiple designated keys into the trained value extraction model M in succession. The extraction unit 105 obtains the estimated values output in succession from the value extraction model M.
  • the extraction unit 105 may input the multiple designated keys to the value extraction model M all at once, rather than inputting each of the multiple designated keys separately to the value extraction model M.
  • the value extraction model M learns the relative positional relationship between the multiple training keys and the training values corresponding to each of the multiple training keys.
  • the value extraction model M converts the multiple designated keys into features together, and outputs an estimation result according to the features.
  • the value extraction system 1 of variant example 4 extracts an estimated value corresponding to each of a plurality of estimated keys from the estimated image EI based on each of the plurality of designated keys. This makes it possible to extract an estimated value corresponding to each of a plurality of designated keys from one estimated image EI, eliminating the need to prepare a separate value extraction model M for each designated key. Even if one estimated image EI contains estimated values corresponding to each of a plurality of designated keys, one value extraction model M can be used.
  • the value extraction system 1 of the fifth modified example includes an additional generation unit 106 and an additional learning unit 107.
  • the additional generation unit 106 generates additional training data for the value extraction model M to learn based on the estimated image EI, the specified key, and the estimated value. For example, the additional generation unit 106 generates an input portion of the additional training data based on estimated key cell information of an estimated key cell extracted from the estimated image EI. The additional generation unit 106 generates an output portion of the additional training data based on estimated value cell information of an estimated value cell including an estimated value. The additional generation unit 106 generates additional training data including the generated input portion and output portion.
  • the additional learning unit 107 performs additional learning of the value extraction model M based on the additional training data.
  • the additional learning is learning based on the additional training data.
  • the additional learning itself may be similar to normal learning.
  • the additional learning unit 107 performs additional learning of the value extraction model M such that when an input portion of the additional training data is input to the value extraction model M, an output portion of the additional training data is output.
  • the value extraction system 1 of variant example 5 generates additional training data for learning the value extraction model M based on the estimated image EI, the specified key, and the estimated value.
  • the value extraction system 1 performs additional learning of the value extraction model M based on the additional training data. This enables the value extraction system 1 to improve the accuracy of the value extraction model M.
  • the relative positional relationship between the estimated key and the estimated value may differ depending on the language of the estimated document ED. For example, if the estimated document ED is in Arabic, the estimated value may be located to the left of the estimated key. For this reason, the value extraction model M may learn the relative positional relationship between the training key and the training value contained in each training document of multiple languages.
  • the input portion of the training data of variant example 6 includes training language information related to the language of the training document. Other portions of the training data may be similar to those of the embodiment.
  • the training language information is an ID indicating a language such as English, Japanese, Korean, Chinese, or Arabic.
  • the training data generation unit 101 automatically identifies the language of the training document based on character recognition of the training document.
  • the training data generation unit 101 may identify the language of the training document based on a method other than character recognition (for example, a method using N-grams or a machine learning method).
  • the creator of the value extraction model M may specify the language of the training document.
  • the training data generation unit 101 generates training language information based on the result of identifying the language of the training document, and includes it in the input portion of the training data.
  • the method of generating the other portions of the training data may be the same as in the embodiment.
  • the learning performed by the learning unit 102 may also be the same as in the embodiment. Since the input portion of the training data includes language identification information, the value extraction model M learns the tendency of the relative positional relationship between the training key and the training value for each language.
  • the value extraction system 1 of the sixth modified example includes a language identification unit 108.
  • the language identification unit 108 identifies the language of the estimated document ED. For example, the language identification unit 108 identifies the language of the estimated document ED based on character recognition or other techniques, similar to the training data generation unit 101.
  • the user may specify the language of the estimated document ED.
  • the language identification unit 108 may identify the language of the estimated document ED by acquiring estimated language information related to the language specified by the user from the user terminal 20.
  • the extraction unit 105 of variant example 6 extracts an estimated value further based on the language of the estimated document ED.
  • the extraction unit 105 inputs the estimated language information, estimated cell information, and designated key to the value extraction model M.
  • the value extraction model M calculates the features of the estimated language information, estimated cell information, and designated key, and outputs an estimation result according to the features.
  • the extraction unit 105 extracts an estimated value from the estimated image EI based on the estimation result output from the value extraction model M.
  • the value extraction system 1 of the sixth modified example extracts an estimated value further based on the language of the estimated document ED. This enables estimation according to the language of the estimated document ED, thereby improving the accuracy of extraction of the estimated value.
  • the relative positional relationship between the estimated key and the estimated value may differ depending on the characteristics of the overall layout of the estimated document ED. Taking the estimated document ED in Japanese as an example, there are estimated document EDs written vertically and estimated document EDs written horizontally. In addition, there are estimated document EDs printed vertically and estimated document EDs printed horizontally. The same is true for other layouts, and the relative positional relationship between the estimated key and the estimated value may differ depending on the overall layout of the estimated document ED.
  • the value extraction model M may have learned features related to the overall layout of the training document.
  • the input portion of the training data in variant example 7 includes training feature information related to the overall layout features of the training document.
  • Other parts of the training data may be similar to those in the embodiment.
  • the training feature information is the orientation of the character string, such as vertical or horizontal writing, the printing orientation, such as vertical or horizontal, the font size, the margin size, the line spacing size, the number of characters per line, or the number of lines per page.
  • the training data generation unit 101 automatically identifies the overall layout characteristics of the training document based on character recognition of the training document.
  • the training data generation unit 101 may also identify the overall layout characteristics of the training document based on a method other than character recognition (for example, a region extraction method, a line segment detection method, a template matching method, or a machine learning method).
  • the creator of the value extraction model M may specify the overall layout characteristics of the training document.
  • the training data generation unit 101 generates training feature information based on the results of identifying the overall layout features of the training document, and includes it in the input portion of the training data.
  • the method of generating other parts of the training data may be similar to that of the embodiment.
  • the learning performed by the learning unit 102 may also be similar to that of the embodiment. Since the input portion of the training data includes feature identification information, the value extraction model M learns the tendency of the relative positional relationship between the training key and the training value for each overall layout feature.
  • the value extraction system includes a feature identification unit 109.
  • the feature identification unit 109 identifies features related to the overall layout of the estimated image EI. For example, similar to the training data generation unit 101, the feature identification unit 109 identifies the overall layout features of the estimated document ED based on character recognition or other techniques. The user may specify the language of the estimated document ED. The feature identification unit 109 may identify the overall layout features of the estimated document ED by acquiring estimated feature information related to the overall layout features specified by the user from the user terminal 20.
  • the extraction unit 105 of variant 7 extracts an estimated value further based on features related to the overall layout of the estimated image EI. For example, the extraction unit 105 inputs the estimated feature information, estimated cell information, and designated key to the value extraction model M.
  • the value extraction model M calculates the feature quantities of the estimated feature information, estimated cell information, and designated key, and outputs an estimation result according to the feature quantities.
  • the extraction unit 105 extracts an estimated value from the estimated image EI based on the estimation result output from the value extraction model M.
  • the value extraction system 1 of variant example 7 extracts estimated values based on characteristics related to the overall layout of the estimated image EI. This enables estimation according to the overall layout of the estimated document ED, improving the accuracy of extracting estimated values.
  • the training key and the training value have a one-to-one relationship, but the training key and the training value may also have a one-to-many relationship.
  • the value extraction model M of variant 8 learns the relative positional relationship between one training key and each of the multiple training values corresponding to that one training key.
  • the output portion of the training data of variant 8 includes training value cell information for each of the multiple training values. This differs from the embodiment in that there is training value cell information for each of the multiple training values for one training key, but is similar to the embodiment in other respects.
  • the extraction unit 105 of variant 8 extracts multiple estimated values corresponding to one specified key from the estimated image EI.
  • the input data that the extraction unit 105 inputs to the trained value extraction model M is the same as in the embodiment.
  • the value extraction model M calculates the feature amount of the input data and outputs an estimation result according to the feature amount.
  • the value extraction model M may output only one estimated key as the estimation result, or may output multiple estimated keys.
  • the extraction unit 105 extracts multiple estimated keys from the estimated image EI.
  • the value extraction system 1 of variant example 8 extracts multiple estimated values corresponding to one specified key from the estimated image EI. This allows the value extraction system 1 to extract multiple estimated values from the estimated document ED in which the estimated keys and estimated values correspond one-to-many.
  • the value extraction model M may learn the relative positional relationship between one training key and each of multiple training values arranged in a predetermined direction.
  • the predetermined direction is either up, down, left, or right.
  • each of the multiple training values is arranged below the training key.
  • the learning method of the value extraction model M when multiple training values exist for one training key may be the same as variant 8.
  • the extraction unit 105 of variant 9 extracts from the estimated image EI each of the multiple estimated values that correspond to one specified key and are aligned in a direction relative to each other.
  • the processing by the extraction unit 105 may be similar to that of variant 8. Since the relative positional relationship between the training key and the multiple training values aligned in a predetermined direction is learned by the value extraction model M, the value extraction model M outputs each of the multiple estimated values aligned in the predetermined direction as an estimation result.
  • the extraction unit 105 extracts each of the multiple estimated values aligned in the predetermined direction output from the value extraction model M from the estimated image EI.
  • the value extraction system 1 of the ninth modified example extracts from the estimated image EI each of a plurality of estimated values that correspond to one specified key and are aligned in a direction relative to one another. This allows the value extraction system 1 to extract a plurality of estimated values from the estimated document ED in which the estimated keys and estimated values correspond one-to-many and in which a plurality of estimated values are aligned in a specified direction.
  • the extraction unit 105 may determine a search range in a predetermined direction for the value extraction model M to search based on the estimated image EI, and extract each of the multiple estimated values based on the search range.
  • the search range is a range in the estimated image EI from which the estimated values are extracted.
  • the extraction unit 105 extracts the estimated values from the search range of the estimated image EI.
  • the extraction unit 105 inputs estimated cell information of estimated cells within the search range to the value extraction model M.
  • the extraction unit 105 does not input estimated cell information of estimated cells outside the search range to the value extraction model M.
  • the extraction unit 105 may determine the search range based on a predetermined determination method. For example, the extraction unit 105 may determine the search range based on the size of the estimated document ED shown in the estimated image EI. In this case, the extraction unit 105 determines the search range such that the larger the size of the estimated document ED, the wider the search range. The extraction unit 105 determines the search range based on a specified key. In this case, it is assumed that the relationship between the specified key and the search range is stored in advance in the data storage unit 100. The extraction unit 105 determines the search range by acquiring the search range associated with the specified key.
  • the extraction unit 105 may determine the search range based on that image.
  • the extraction unit 105 may identify that image from the estimated image EI based on a technique such as pattern matching, and determine the position to that image as the search range.
  • the extraction unit 105 may determine, as the search range, an area within a specific distance from a position in the estimated image EI where the same character string as the designated key is placed. This differs from the embodiment in that the value extraction model M targets the search range for estimation, but the estimation by the value extraction model M itself is as described in the embodiment.
  • the value extraction system 1 of the modified example 10 determines a search range in a predetermined direction for the value extraction model M to search based on the estimated image EI, and extracts each of the multiple estimated values based on the search range. This improves the accuracy of extracting the estimated values when multiple estimated values are lined up in a predetermined direction.
  • the value extraction system 1 may learn to extract an estimated value based on not only the positional relationship but also the character type (e.g., letters, numbers, mixed letters and numbers, etc.).
  • the character type of the training value may be included in the output portion of the training data.
  • the training data generation unit 101 automatically identifies the character type of the training value based on character recognition of the training document.
  • the training data generation unit 101 may also identify the character type of the training value based on a method other than character recognition (for example, a method using N-gram or a machine learning method).
  • the creator of the value extraction model M may specify the character type of the training value.
  • the training data generation unit 101 generates training language information based on the result of identifying the character type of the training value, and includes it in the output portion of the training data.
  • the method of generating other parts of the training data may be the same as in the embodiment.
  • the learning performed by the learning unit 102 may also be the same as in the embodiment. Since the output portion of the training data includes information on the character type of the training value, the value extraction model M also learns the tendency of the character type of the training value corresponding to the training key.
  • the relative positional relationship between the matrix of the training key and the matrix of the training value is learned by the value extraction model M, but the relative positional relationship between the coordinates indicating the position of the training key and the coordinates indicating the position of the training value may also be learned by the value extraction model M.
  • the estimated cell information includes the coordinates of the estimated cell EC.
  • the value extraction model M converts the coordinates of the estimated cell EC and the designated key into feature quantities, and outputs an estimated value according to the feature quantities.
  • the position of the character string may be learned in the value extraction model M without the estimated cell EC being particularly extracted.
  • the main processing is performed by the server 10, but the processing described as being performed by the server 10 may be performed by the user terminal 20 or another computer, or may be shared among multiple computers.
  • the value extraction system can be configured as follows. (1) an estimated image acquisition unit that acquires an estimated image showing an estimated document including an estimated key and an estimated value; a designated key acquisition unit for acquiring a designated key designated by a user; an extracting unit that extracts the estimated value from the estimated image based on the estimated image, the designated key, and a value extraction model in which a relative positional relationship between a training key and a training value included in a training document shown in the training image is learned; A value extraction system including: (2) the value extraction model learns a relative positional relationship between rows and columns of the training keys in the training image and rows and columns of the training values in the training image; The extraction unit is obtaining a row and a column of the specified key based on a result of performing character recognition on the estimated image; extracting the estimated value based on the rows and columns of the specified key and the value extraction model; A value extraction system as described in (1).
  • the value extraction model learns a relative positional relationship between training key cell information regarding a position of a training key cell including the training key and training value cell information regarding a position of a training value cell including the training value,
  • the extraction unit is obtaining estimated key cell information relating to a position of an estimated key cell including the estimated key based on character recognition of the estimated image; extracting the estimated value based on the estimated key cell information and the value extraction model;
  • the value extraction model learns whether the training value exists in each of a plurality of directions based on the position of the training key in the training image; the extraction unit extracts the estimated value by causing the value extraction model to determine whether or not the estimated value exists in each of the plurality of directions based on a position of the designated key in the estimated image.
  • a value extraction system according to any one of (1) to (3).
  • a priority order is set for each of the plurality of directions, When there are a plurality of directions in which it is determined that the estimated value exists, the extraction unit extracts the estimated value based on the priority order of each of the plurality of directions.
  • a value extraction system as described in (4).
  • the value extraction model has learned a relative positional relationship between the training keys and the training values included in the first type of training document;
  • the extraction unit extracts the estimated value from the estimated image in which the estimated document of a second type different from the first type is shown.
  • a value extraction system according to any one of (1) to (5).
  • the value extraction model learns a relative positional relationship between each of the plurality of training keys and the training value corresponding to the training key;
  • the estimated document includes each of a plurality of estimated keys and the estimated value corresponding to the estimated key,
  • the designated key acquisition unit acquires a plurality of the designated keys, the extraction unit extracts, from the estimated image, the estimated values corresponding to the respective estimated keys, based on the respective designated keys;
  • a value extraction system according to any one of (1) to (6).
  • the value extraction system comprises: a training data generating unit that generates training data including training key cell information on a position of a training key cell including the training key and training value cell information on a position of a training value cell including the training value based on character recognition of the training image; A learning unit that executes learning of the value extraction model based on the training data; Further comprising: The extraction unit extracts the estimated value based on the value extraction model learned by the learning unit.
  • a value extraction system according to any one of (1) to (7).
  • the value extraction system comprises: an additional generation unit that generates additional training data for learning the value extraction model based on the estimated image, the specified key, and the estimated value; an additional learning unit that performs additional learning of the value extraction model based on the additional training data; A value extraction system according to any one of (1) to (8).
  • the value extraction model learns relative positional relationships between the training keys and the training values included in the training documents in each of a plurality of languages;
  • the value extraction system further includes a language identification unit that identifies a language of the estimated document, The extraction unit extracts the estimated value further based on a language of the estimated document.
  • a value extraction system according to any one of (1) to (9).
  • the value extraction model has learned features related to the overall layout of the training documents;
  • the value extraction system further includes a feature identifier for identifying features related to an overall layout of the estimated image;
  • the extraction unit extracts the estimated value further based on features related to an overall layout of the estimated image.
  • (12) The value extraction model learns a relative positional relationship between one of the training keys and one of the training values corresponding to the one of the training keys, the extraction unit extracts, from the estimated image, one of the estimated values corresponding to one of the designated keys;
  • a value extraction system according to any one of (1) to (11).
  • the value extraction model learns a relative positional relationship between one training key and each of the plurality of training values corresponding to the one training key, the extraction unit extracts, from the estimated image, a plurality of the estimated values corresponding to one of the designated keys; A value extraction system according to any one of (1) to (12).
  • the value extraction model learns a relative positional relationship between the one training key and each of the plurality of training values arranged in a predetermined direction relative to each other; the extraction unit extracts, from the estimated image, each of the plurality of estimated values corresponding to the one designated key and aligned in the direction.
  • the extraction unit determines a search range in which the value extraction model is to search in the direction based on the estimated image, and extracts each of the multiple estimated values based on the search range.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)
PCT/JP2023/013598 2023-03-31 2023-03-31 バリュー抽出システム、バリュー抽出方法、及びプログラム Pending WO2024202018A1 (ja)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2025509603A JPWO2024202018A1 (enrdf_load_stackoverflow) 2023-03-31 2023-03-31
PCT/JP2023/013598 WO2024202018A1 (ja) 2023-03-31 2023-03-31 バリュー抽出システム、バリュー抽出方法、及びプログラム

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2023/013598 WO2024202018A1 (ja) 2023-03-31 2023-03-31 バリュー抽出システム、バリュー抽出方法、及びプログラム

Publications (1)

Publication Number Publication Date
WO2024202018A1 true WO2024202018A1 (ja) 2024-10-03

Family

ID=92904607

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/013598 Pending WO2024202018A1 (ja) 2023-03-31 2023-03-31 バリュー抽出システム、バリュー抽出方法、及びプログラム

Country Status (2)

Country Link
JP (1) JPWO2024202018A1 (enrdf_load_stackoverflow)
WO (1) WO2024202018A1 (enrdf_load_stackoverflow)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018128996A (ja) * 2017-02-10 2018-08-16 キヤノン株式会社 情報処理装置、制御方法、およびプログラム
JP2021077332A (ja) * 2019-11-05 2021-05-20 キヤノン株式会社 情報処理装置、サーバ、システム、情報処理方法、およびプログラム

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018128996A (ja) * 2017-02-10 2018-08-16 キヤノン株式会社 情報処理装置、制御方法、およびプログラム
JP2021077332A (ja) * 2019-11-05 2021-05-20 キヤノン株式会社 情報処理装置、サーバ、システム、情報処理方法、およびプログラム

Also Published As

Publication number Publication date
JPWO2024202018A1 (enrdf_load_stackoverflow) 2024-10-03

Similar Documents

Publication Publication Date Title
CN111476227B (zh) 基于ocr的目标字段识别方法、装置及存储介质
EP3712812B1 (en) Recognizing typewritten and handwritten characters using end-to-end deep learning
US11348330B2 (en) Key value extraction from documents
JP6507472B2 (ja) 処理方法、処理システム及びコンピュータプログラム
US10572725B1 (en) Form image field extraction
US9384389B1 (en) Detecting errors in recognized text
US20160371246A1 (en) System and method of template creation for a data extraction tool
EP3142041B1 (en) Information processing apparatus, information processing method and program
JP2021082266A (ja) 文書処理のための位置埋め込み
CN112801099B (zh) 一种图像处理方法、装置、终端设备及介质
CN112801084A (zh) 图像处理方法及装置、电子设备和存储介质
US11663398B2 (en) Mapping annotations to ranges of text across documents
TW200416583A (en) Definition data generation method of account book voucher and processing device of account book voucher
WO2021143058A1 (zh) 基于图像的信息比对方法、装置、电子设备及计算机可读存储介质
US12299276B2 (en) Digital ink processing system, method, and program
KR102561878B1 (ko) 머신러닝 기반의 ai 블루 ocr 판독 시스템 및 판독 방법
CN115545036A (zh) 文档中的阅读顺序检测
CN116324910A (zh) 用于执行设备上图像到文本转换的方法和系统
JP6856916B1 (ja) 情報処理装置、情報処理方法及び情報処理プログラム
JP7470264B1 (ja) レイアウト解析システム、レイアウト解析方法、及びプログラム
CN118155220A (zh) 一种字符识别方法、系统、电子设备及存储介质
WO2024202018A1 (ja) バリュー抽出システム、バリュー抽出方法、及びプログラム
CN106934336A (zh) 一种幻灯片识别的方法及装置
JP7507331B1 (ja) レイアウト解析システム、レイアウト解析方法、及びプログラム
JP2021028770A (ja) 情報処理装置及び表認識方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23930653

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2025509603

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2025509603

Country of ref document: JP