AU2021412659A1

AU2021412659A1 - Architecture for digitalizing documents using multi-model deep learning, and document image processing program

Info

Publication number: AU2021412659A1
Application number: AU2021412659A
Authority: AU
Inventors: Hossain Shariar Sheikh
Original assignee: Deloitte Touche Tohmatsu LLC
Current assignee: Deloitte Touche Tohmatsu LLC
Priority date: 2020-12-28
Filing date: 2021-12-23
Publication date: 2023-07-13
Also published as: JP2022169754A; WO2022145343A1; AU2021412659A9; JP7150809B2; JP2022104411A

Abstract

The purpose of the present invention is to convert a character string contained in a document image to text data by a method different from conventional optical character recognition. In the present invention, an electronic document generation device is provided with: a document image acquisition unit that acquires a document image obtained by imaging a document; a character string recognition unit that recognizes a character string contained in the document image acquired by the document image acquisition unit using a character string learning model that has learned correspondence between document images and character strings contained in the document images and outputs text data of the recognized character string; and an output unit that outputs the text data as text for an electronic medium.

Description

ARCHITECTURE FOR DIGITALIZING DOCUMENTS USING MULTI-MODEL DEEP LEARNING AND DOCUMENT IMAGE PROCESSING PROGRAM BACKGROUND OF THE INVENTION

1. Field of the Invention

[0001] The present invention relates to an electronic document generation device, an electronic document generation method, and an electronic document generation program

and particularly to an electronic document generation device, an electronic document

generation method, and an electronic document generation program for scanning paper

documents and generating electronic documents.

2. Description of Related Art

[0002] Digital information technology has advanced and paperless systems have spread, but storage or delivery of information using paper documents is still widely used.

In companies having a large amount of paper documents, for example, there is demand for

technology capable of efficiently converting paper documents to digital documents.

[0003] In OCR text recognition technology in the related art, there is a problem in that recognition efficiency of character recognition is poor because the character recognition

is performed by characters (for example, see Patent Document 1).

[Citation List]

[Patent Document]

[0004]

[Patent Document 1]

Japanese Unexamined Patent Application, First Publication No. 2010-244372

SUMMARY OF THE INVENTION

[Technical Problem]

[0005] Therefore, an electronic document generation device, an electronic document generation method, and an electronic document generation program according to the present disclosure are provided for converting a character string contained in a document image to text data using a method different from optical character recognition in the related art.

[Solution to Problem]

[0006] That is, an electronic document generation device according to a first aspect includes: a document image acquiring unit configured to acquire a document image obtained

by imaging a document; a character string recognizing unit configured to recognize a

character string contained in the document image acquired by the document image acquiring

unit using a character string learning model having learned correspondence between

document images and character strings contained in the document images and to generate

text data of the character string; and an output unit configured to output the text data as text

of an electronic medium.

[0007] A second aspect provides the electronic document generation device

according to the first aspect, further including a layout recognizing unit configured to identify

a range of each of a plurality of elements contained in the document image acquired by the

document image acquiring unit in the document image using a layout learning model having

learned correspondence between a plurality of elements contained in document images and

identification information of the plurality of elements, to recognize a type of each of the

plurality of elements, and to acquire position information of each of the plurality of elements

in the document image associated of the range, wherein the character string recognizing unit

recognizes a character string contained in the range recognized by the layout recognizing

unit using the character string learning model and generates text data of the character string,

and the output unit outputs the text data associated with the plurality of elements in the

position information of the range associated with the plurality of elements as text of an

electronic medium.

[0008] A third aspect provides the electronic document generation device

according to the second aspect, wherein the type of each element is one of a character string,

a table, an image, a seal, and handwriting.

[0009] A fourth aspect provides the electronic document generation device according to the third aspect, further including an extraction unit configured to extract each of cells in a table included in an element of which the type recognized by the layout recognizing unit corresponds to a table and to acquire position information of each cell in the document image, wherein the character string recognizing unit recognizes a character string included in each cell extracted by the extraction unit using the character string learning model and generates text data of the character string.

[0010] A fifth aspect provides the electronic document generation device according to any one of the second to fourth aspects, wherein annotations associated with the types corresponding to the elements are given to the elements in the document image including the plurality of elements, the electronic document generation device further includes a layout learning data generating unit configured to accumulate a plurality of the document images to which the annotations are given to generate layout learning data, and the layout learning data is used for supervised learning of the layout learning model.

[0011] A sixth aspect provides the electronic document generation device according to the fifth aspect, wherein position information of ranges associated with the plurality of elements included in the document image in the document image along with the annotations is given to the document image.

[0012] A seventh aspect provides the electronic document generation device according to the fifth or sixth aspect, further including a layout learning data correcting unit configured to correct at least one of the types of the plurality of elements recognized by the layout recognizing unit and the position information of the ranges of the plurality of elements in the document image on the basis of an input and to update the layout learning data by adding the corrected data.

[0013] An eighth aspect provides the electronic document generation device according to the seventh aspect, further including a layout learning unit configured to perform re-learning of the layout learning model using the layout learning data updated by the layout learning data correcting unit.

[0014] A ninth aspect provides the electronic document generation device according to any one of the second to eighth aspects, further including a character string learning data generating unit configured to generate character string learning data which is used for supervised learning of the character string learning model.

[0015] A tenth aspect provides the electronic document generation device

according to the ninth aspect, further including a character string learning data correcting

unit configured to correct text data generated by the character string recognizing unit on the

basis of an input and to update the character string learning data by adding the corrected text

data.

[0016] An eleventh aspect provides the electronic document generation device according to the tenth aspect, further including a character string learning unit configured to

perform re-learning of the character string learning model using the character string learning

data updated by the character string learning data correcting unit.

[0017] A twelfth aspect provides the electronic document generation device according to any one of the second to eleventh aspects, wherein the character string

recognizing unit includes a plurality of the character string learning models and uses the

character string learning models adapted to languages of the character strings included in the

plurality of elements.

[0018] A thirteenth aspect provides the electronic document generation device according to any one of the second to twelfth aspects, further including a preprocessing unit

configured to perform preprocessing on the document image acquired by the document

image acquiring unit, wherein the preprocessing unit includes a background eliminating unit,

a tilt correcting unit, and a shape adjusting unit, the background eliminating unit eliminates

a background of the document image acquired by the document image acquiring unit, the tilt

correcting unit corrects a tilt of the document image acquired by the document image

acquiring unit, and the shape adjusting unit adjusts a shape and a size of the document image

as a whole acquired by the document image acquiring unit.

[0019] A fourteenth aspect provides the electronic document generation device

according to any one of the second to thirteenth aspects, wherein the layout learning model

is one of a layout learning model for a contract, a layout learning model for a bill, a layout learning model for a memorandum, a layout learning model for a delivery note, and a layout learning model for a receipt.

[0020] An electronic document generation method according to a fifteenth aspect is performed by a computer used for an electronic document generation device and includes:

a document image acquiring step of acquiring a document image obtained by imaging a

document; a character string recognizing step of recognizing a character string contained in

the document image acquired in the document image acquiring step using a character string

learning model having learned correspondence between document images and character

strings contained in the document images and generating text data of the character string;

and an output step of outputting the text data as text of an electronic medium.

[0021] An electronic document generation program according to a sixteenth aspect causes a computer used for an electronic document generation device to perform: a document

image acquiring function of acquiring a document image obtained by imaging a document;

a character string recognizing function of recognizing a character string contained in the

document image acquired in the document image acquiring function using a character string

and an output function of outputting the text data as text of an electronic medium.

[Advantageous Effects of Invention]

[0022] Since the electronic document generation device according to the present

disclosure includes the document image acquiring unit configured to acquire a document

image obtained by imaging a document, the character string recognizing unit configured to

recognize a character string contained in the document image acquired by the document

image acquiring unit using a character string learning model having learned correspondence

between document images and character strings contained in the document images and to

generate text data of the character string, and the output unit configured to output the text

data as text of an electronic medium and recognizes a character string contained in a

document image using a model subjected to machine learning, it is possible to improve

recognition efficiency of character recognition when the document image is converted to text data.

BRIEF DESCRIPTION OF THE DRAWINGS

[0023] Features, advantages, and technical and industrial significance of exemplary embodiments of the invention will be described below with reference to the accompanying drawings, in which like numerals denote like elements, and wherein: FIG. 1 is a diagram schematically illustrating an electronic document generation system including an electronic document generation device according to an embodiment. FIG. 2 is a block diagram illustrating a physical configuration of the electronic document generation device. FIG. 3 is a diagram schematically illustrating processes which are performed by the electronic document generation device. FIG. 4 is a block diagram illustrating a functional configuration of the electronic document generation device. FIG. 5 is a diagram illustrating input data and output data of the electronic document generation device. FIG. 6 is a diagram illustrating background removal which is performed in preprocessing. FIG. 7 is a diagram illustrating tilt correction which is performed in preprocessing. FIG. 8 is a diagram illustrating shape adjustment which is performed in preprocessing. FIG. 9 is a diagram illustrating a correction process for missing resolution which is performed in a layout recognizing process. FIG. 10 is a diagram illustrating a correction process for overlap resolution which is performed in the layout recognizing process. FIG. 11 is a diagram illustrating recognition of a layout which is performed in the layout recognizing process. FIG. 12 is a diagram illustrating recognition of a table which is performed in the layout recognizing process. FIG. 13 is a diagram illustrating extraction of a cell image.

FIG. 14 is a diagram illustrating a character string in a cell image.

FIG. 15 is a diagram illustrating arrangement of text data which is performed in a

character string recognizing process.

FIG. 16 is a diagram illustrating elimination of noise which is performed in a character

string recognizing process.

FIG. 17 is a diagram illustrating an example of layout learning data with annotations.

FIG. 18 is a diagram illustrating an example of layout learning data with annotations.

FIG. 19 is a diagram illustrating an example of layout learning data with annotations.

FIG. 20 is a diagram illustrating an example of layout learning data with annotations.

FIG. 21 is a diagram illustrating an example of layout learning data with annotations.

FIG. 22 is a diagram illustrating an example of layout learning data with annotations.

FIG. 23 is a diagram illustrating an example of layout learning data with annotations.

FIG. 24 is a diagram illustrating an example of character string learning data with

annotations.

FIG. 25 is a flowchart of an electronic document generation program.

FIG. 26 is a flowchart (1/3) of an embodiment of the electronic document generation

program.

FIG. 27 is a flowchart (2/3) of the embodiment of the electronic document generation

program.

FIG. 28 is a flowchart (3/3) of the embodiment of the electronic document generation

program.

DETAILED DESCRIPTION OF EMBODIMENTS

[0024] An electronic document generation device 10 according to an embodiment

of the present disclosure will be described below with reference to FIGS. 1 to 24. In this

embodiment, it is assumed that the electronic document generation device 10 is connected

to an information communication network 11 such as the Internet or a local area network

(LAN) for use. An electronic document generation system 100 including the electronic

document generation device 10 will be schematically described below with reference to FIG.

1. FIG. 1 is a diagram schematically illustrating the electronic document generation system 100 including the electronic document generation device 10.

[0025] The electronic document generation system 100 includes an electronic document generation device 10, a user terminal 12, a character string learning model 13, a layout learning model 14, and a document image database 15. The electronic document generation device 10, the user terminal 12, the character string learning model 13, the layout learning model 14, and the document image database 15 are connected to an information communication network 11 and are able to perform information communication with each other.

[0026] The electronic document generation system 100 recognizes character strings contained in a document image and generates text data using the electronic document generation device 10. The electronic document generation device 10 recognizes a layout of a document image using a layout learning model and recognizes character strings contained in the document image using a character string learning model.

[0027] The electronic document generation device 10 is, for example, a type of computer such as a PC and is an information processing device. The electronic document generation device 10 includes an arithmetic processing device and a microcomputer included in various computers and includes instruments and devices for embodying functions according to the present disclosure using applications.

[0028] The character string learning model 13 is a learning model for recognizing an image of a character string contained in a document image and is used for character recognition in the electronic document generation device 10. The character string learning model 13 is not particularly limited in a storage place thereof as long as it can be used via the information communication network 11 by the electronic document generation device 10 and is stored, for example, in an information processing device such as a PC, a server device, or a database. For the purpose of convenient explanation of this embodiment, the character string learning model 13 is assumed to mean an information processing device in which the character string learning model 13 is stored.

[0029] The character string learning model 13 is constituted by an existing learning model or may be independently constituted as a learning model which is appropriate for use of the electronic document generation device 10. The character string learning model 13 includes learning models appropriate for various languages such as Japanese, English, and Chinese and is illustrated as a first character string learning model, a second character string learning model, and a third character string learning model in FIG. 1.

[0030] The character string learning model 13 is not limited to connection to the information communication network 11, and may be included in the electronic document generation device 10 and used under the direct control of the electronic document generation device 10. The character string learning model 13 may be distributed and stored in a plurality of information processing devices connected to the information communication network 11.

[0031] The layout learning model 14 is a learning model that learns correspondence between a plurality of elements contained in document images and identification information of the plurality of elements on the basis of layout learning data which will be described later and is used for layout recognition in the electronic document generation device 10. Similarly to the character string learning model 13, the layout learning model 14 is not particularly limited in a storage place thereof as long as it can be used via the information communication network 11 by the electronic document generation device 10 and is stored, for example, in an information processing device connected to the information communication network 11. For the purpose of convenient explanation of this embodiment, the layout learning model 14 is assumed to mean an information processing device in which the layout learning model 14 is stored.

[0032] The layout learning model 14 includes a layout learning model for a contract, a layout learning model for a bill, a layout learning model for a memorandum, a layout learning model for a delivery note, and a layout learning model for a receipt.

[0033] The layout learning model for a contract is a learning model for recognizing a layout of a document image of a contract and performs learning using the layout learning data for a contract. The layout learning model for a contract learns at what position in the contract what information is located and particularly learns layouts specific to a contract such as itemization, no tables, and a handwriting signature box. The layout learning data for a contract is generated on the basis of document images of at least 3 or 4 contract sheets for each form of 200 types of contract forms with annotations which will be described later.

[0034] The layout learning model for a bill is a learning model for recognizing a

layout of a document image of a bill and performs learning using the layout learning data for

a bill. The layout learning model for a bill learns at what position in the bill what

information is located and particularly learns layouts specific to a bill such as a broad range

of tables, and many wordings such as alphanumeric characters in Japanese. The layout

learning data for a bill is generated on the basis of document images of at least 3 or 4 bill

sheets for each form of 200 types of bill forms with annotations which will be described later.

[0035] The layout learning model for a memorandum is a learning model for

recognizing a layout of a document image of a memorandum and performs learning using

the layout learning data for a memorandum. The layout learning model for a memorandum

learns at what position in the memorandum what information is located and particularly

learns layouts specific to a memorandum such as no tables and a handwriting signature box.

The layout learning data for a memorandum is generated on the basis of document images

of at least 3 or 4 memorandum sheets for each form of 200 types of memorandum forms with

annotations which will be described later.

[0036] The layout learning model for a delivery note is a learning model for

recognizing a layout of a document image of a delivery note and performs learning using the

layout learning data for a delivery note. The layout learning model for a delivery note

learns at what position in the delivery note what information is located and particularly learns

layouts specific to a delivery note such as a broad range of tables and writing of many product

names and product numbers. The layout learning data for a delivery note is generated on

the basis of document images of at least 3 or 4 delivery note sheets for each form of 200

types of delivery note forms with annotations which will be described later.

[0037] The layout learning model for a receipt is a learning model for recognizing

a layout of a document image of a receipt and performs learning using the layout learning

data for a receipt. The layout learning model for a receipt learns at what position in the receipt what information is located and learns layouts specific to a receipt such as a handwriting box for an amount of money and many tables in which amounts of money are written. The layout learning data for a receipt is generated on the basis of document images of at least 3 or 4 receipt sheets for each form of 200 types of receipt forms with annotations which will be described later.

[0038] The layout learning model 14 is not limited to use in the electronic document generation device 10 via the information communication network 11, and may be included

in the electronic document generation device 10. The layout learning model 14 may be

distributed and stored in a plurality of information processing devices connected to the

information communication network 11.

[0039] The document image database 15 is a database in which document images areaccumulated. The electronic document generation device 10 acquires document images

stored in the document image database 15 and generates character string learning data used

for learning of a character string learning model and layout learning data used for learning

of a layout learning model.

[0040] The user terminal 12 is used to operate the electronic document generation

device 10. When erroneously recognized characters are present in an electronic document

generated by the electronic document generation device 10 or when a layout of an electronic

document generated by the electronic document generation device 10 is erroneously

recognized, the electronic document is corrected on the basis of a correction input from a

user of the user terminal 12, and the electronic document generation device 10 receives the

correction and re-training at least one of the character string learning model 13 and the layout

learning model 14.

[0041] A mechanical configuration of the electronic document generation device

10 will be described below with reference to FIG. 2. FIG. 2 is a block diagram illustrating

the mechanical configuration of the electronic document generation device 10. The

electronic document generation device 10 includes an input/output interface 20, a

communication interface 21, a read only memory (ROM) 22, a random access memory

(RAM) 23, a storage unit 24, a central processing unit (CPU) 25, and a graphics processing units (GPU) 28.

[0042] The input/output interface 20 performs transmission and reception of data to and from an external device outside of the electronic document generation device 10.

The external device includes an input device 26 and an output device 27 that input and output

data, for example, to and from the electronic document generation device 10. The input

device 26 includes a keyboard, a mouse, and a scanner, and the output device 27 includes a

monitor, a printer, and a speaker.

[0043] The communication interface 21 has a function of inputting and outputting data, for example, to and from the electronic document generation device 10 in

communication with the outside via the information communication network 11.

The storage unit 24 can be used as a storage device, and various applications required

for operation of the electronic document generation device 10 and various types of data used

by the applications are recorded thereon. The GPU 28 can be suitably used when repeated

arithmetic operations performed in machine learning, for example, are often used and is used

along with the CPU 25.

[0044] The electronic document generation device 10 stores an electronic

document generation program which will be described later in the ROM 22 or the storage

unit 24 and loads the electronic document generation program to a main memory including

the RAM 23. The CPU 25 accesses the main memory to which the electronic document

generation program has been loaded and executes the electronic document generation

program.

[0045] Processes which are performed by the electronic document generation

device 10 will be schematically described below with reference to FIG. 3. FIG. 3 is a

diagram schematically illustrating processes which are performed by the electronic

document generation device 10.

The electronic document generation device 10 performs Processes I to III which will

be described below in that order.

[0046] In Process I, preprocessing 55 including "background elimination," "tilt

correction," and "shape adjustment" of a document image is performed.

The preprocessing 55 means that preprocessing for facilitating character recognition using a learning model is performed on an image including a character string and is for improving recognition accuracy of recognition processes which are performed in Processes II and III.

[0047] In Process II, a layout recognizing process 56 is performed. In the layout recognizing process 56, "layout recognition" of a document image is first performed. The layout recognizing process 56 is a process of recognizing at what position in an input image what information is located.

[0048] Information mentioned herein means things such as a character string, a table, an image, a seal, and handwriting. The electronic document generation device 10 recognizes a layout of a document image, performs "recognition of a table" when a table is contained in the document image, and performs "extraction of an image of a cell" on cells contained in the table.

[0049] In Process III, a character string recognizing process 57 is performed. The character string recognizing process 57 is a process of converting an image containing a character string to text data using the character string learning model 13 having learned correspondence between images and character strings contained in the images. The character string recognizing process 57 may include processes such as "arrangement of text data" and "removal of noise."

[0050] In the character string recognizing process 57, an image of a character string is converted to text data, and "arrangement of text data" and "removal of noise" are performed. "Arrangement of text data" means that, when a space is contained in an extracted image of a character string, the space is also recognized along with the character string and thus text data is arranged along with the space.

[0051] "Removal of noise" means that, when noise is contained in an extracted image of a character string, noise is not recognized by the electronic document generation device 10 and thus is passively removed from the text data. Noise mentioned herein means a pixel not constituting a character in the extracted image of a character string.

[0052] A functional configuration of the electronic document generation device 10

will be described below with reference to FIG. 4. FIG. 4 is a block diagram illustrating the

functional configuration of the electronic document generation device 10. The electronic

document generation device 10 includes a document image acquiring unit 31, a

preprocessing unit 32, a background eliminating unit 32a, a tilt correcting unit 32b, a shape

adjusting unit 32c, a layout recognizing unit 33, an extraction unit 34, a character string

recognizing unit 35, an output unit 36, a layout learning data generating unit 40, a layout

learning data correcting unit 41, a layout learning unit 42, a character string learning data

generating unit 43, a character string learning data correcting unit 44, and a character string

learning unit 45 by causing the CPU 25 to execute an electronic document generation

program which will be described later.

[0053] The document image acquiring unit 31 (see FIG. 4) acquires a document

image by imaging a document.

The document image acquiring unit 31 may acquire a document image form the

document image database 15. Alternatively, the document image acquiring unit 31 may

acquire a document image from the scanner of the input device 26.

[0054] A document image acquired by the document image acquiring unit 31 and

an electronic document output from the electronic document generation device 10 will be

described below with reference to FIG. 5.

FIG. 5 is a diagram illustrating input data and output data of the electronic document

generation device 10, where FIG. 5(a) illustrates a document image acquired as input data

by the document image acquiring unit 31. Noise such as a stapler mark 50, handwriting 51,

a seal 52, and an image 53 is present in the document image.

[0055] This noise serves as an obstacle in a person or an information processing

device such as a PC understanding details of the document or is not necessary therefor.

Another example of noise includes a punched hole for filing and a fold remaining in a paper

sheet. The fold may be recognized as a line and needs to be removed such that it is not

reflected in the electronic document.

[0056] The electronic document generation device 10 converts a character string in the document image to text data while maintaining the layout of the acquired document image and outputs an electronic document (see FIG. 5(b)). The electronic document generation device 10 performs noise removal by performing active processing on a stapler mark 50, handwriting 51, a seal 52, and an image 53 recognized as noise and removes pixels in the document image which are not recognized as a character string and noise through passive processing of not leaving the pixels in the electronic document.

[0057] A table in the document image illustrated in FIG. 5(b) is output as object data in the electronic document along with text data while maintaining the arrangement of

the document image. The electronic document generation device 10 can arbitrarily select

elements contained in the output electronic document.

[0058] For example, a stapler mark 50, handwriting 51, a seal 52, and an image 53 are removed in normal use, but the seal 52 and the image 53 may be included as image data

in the electronic document and then output.

[0059] The preprocessing unit 32 (see FIG. 4) performs the preprocessing 55 on the document image acquired by the document image acquiring unit 31.

The preprocessing 55 is performed to improve recognition accuracy of image

recognition using a learning model in the layout recognizing unit 33 and the character string

recognizing unit 35 which will be described later.

[0060] The preprocessing unit 32 includes the background eliminating unit 32a, the

tilt correcting unit 32b, and the shape adjusting unit 32c.

The background eliminating unit 32a (see FIG. 4) eliminates the background of the

document image acquired by the document image acquiring unit 31.

[0061] Processes which are performed by the background eliminating unit 32a will

be described below with reference to FIG. 6. FIG. 6 is a diagram illustrating background

elimination which is performed in the preprocessing 55. FIG. 6(a) illustrates a document

image 58a before background elimination has been performed thereon, and FIG. 6(b)

illustrates a document image 58b after background elimination has been performed thereon.

[0062] The background eliminating unit 32a eliminates the background of the

document image by changing the background color of the document image to white.

Specifically, the background eliminating unit 32a detects the background color of the acquired document image and determines whether the background color is white. When it is determined that the background color is not white, the background eliminating unit 32a extracts information other than the background of the document image, changing the background color to white, and then overlaps the extracted information thereon.

[0063] With the background eliminating unit 32a, it is possible to remove noise serving as a reason of erroneous image recognition in the layout recognizing unit 33 and the character string recognizing unit 35 by eliminating the background and to improve recognition accuracy.

[0064] The tilt correcting unit 32b (see FIG. 4) corrects a tilt of the document image acquired by the document image acquiring unit 31. Processes which are performed by the tilt correcting unit 32b will be described below with reference to FIG. 7. FIG. 7 is a diagram illustrating tile correction which is performed in the preprocessing 55. FIG. 7(a) illustrates a document image 59a before tilt correction has been performed thereon, and FIG. 7(b) illustrates a document image 59b after tilt correction has been performed thereon.

[0065] The tilt correcting unit 32b corrects a tilt of a character string when a tilted character string is contained in the document image such that the character string is parallel or perpendicular to a writing direction. The tilt correcting unit 32b corrects the tilted character string to be parallel to a vertical writing direction when the document image is in vertical writing, and corrects the tilted character string to be parallel to a horizontal writing direction when the document image is in horizontal writing.

[0066] Specifically, the tilt correcting unit 32b extracts character strings of a document image and determines whether a tilted character string is contained in the extracted character strings. When it is determined that a tilted character string is present in the extracted character strings, the tilt correcting unit 32b detects a tilt angle with respect to the writing direction of the tilted character string and performs a rotating process such that the tilt angle of the tilted character string is zero.

[0067] With the tilt correcting unit 32b, it is possible to improve recognition accuracy of image recognition in the character string recognizing unit 35 by correcting a tilt of a character string. It is also possible to reduce an error of layout recognition in the layout recognizing unit 33.

[0068] The shape adjusting unit 32c (see FIG. 4) adjusts a shape and a size of the document image as a whole acquired by the document image acquiring unit 31.

Processes which are performed by the shape adjusting unit 32c will be described below

with reference to FIG. 8. FIG. 8 is a diagram illustrating shape adjustment which is

performed in the preprocessing. FIG. 8(a) illustrates a document image 60a before shape

adjustment has been performed thereon, and FIG. 8(b) illustrates a document image 60b after

shape adjustment has been performed thereon.

[0069] When the shape of the document image as a whole acquired by the document image acquiring unit 31 is different from that of an actual document, the shape

adjusting unit 32c adjusts a shape of the document image as a whole on the basis of the whole

shape of the actual document. Specifically, when an aspect ratio of the document image as

a whole acquired by the document image acquiring unit 31 is different from an aspect ratio

of the whole actual document, the shape adjusting unit 32c performs adjustment such that

the aspect ratio of the whole document image is equal to the aspect ratio of the whole actual

document.

[0070] When the size of the document image acquired by the document image

acquiring unit 31 is excessively large or excessively small, there is a likelihood that

subsequent processes will not be performed normally and thus the shape adjusting unit 32c

adjusts the size of the document image acquired by the document image acquiring unit 31

such that the subsequent processes are performed normally.

[0071] With the shape adjusting unit 32c, by adjusting the shape and the size of the

document image acquired by the document image acquiring unit 31, it is possible to improve

recognition accuracy of a layout based on an actual document in the layout recognizing unit

33 which is performed subsequently and to further improve recognition accuracy of image

recognition in the character string recognizing unit 35.

[0072] The layout recognizing unit 33 (see FIG. 4) identifies a range of each of a plurality of elements contained in the document image acquired by the document image acquiring unit 31 in the document image 61 using a layout learning model 14 having learned correspondence between a plurality of elements contained in the document image 61 and identification information of each of the plurality of elements, recognizes a type of each of the plurality of elements, and acquires position information of the ranges of the plurality of elements in the document image 61.

[0073] The type of each element may be one of a character string 48, a table 49, an image 53, a seal 52, and handwriting 51. The type of each element is not limited thereto, and a stapler mark 50, a punched hole mark, a fractured (tom) mark, and a copying carbon stain, for example, may be used.

[0074] Types appropriate for document types (for example, a contract, a bill, a memorandum, a delivery note, or a receipt) may be used as the type of each element. For example, when copying carbon is added to a rear side of a receipt and the carbon is transferred to a front side to form a stain, the copying carbon stain may be actively removed using the copying carbon stain as the type of an element.

[0075] The layout learning model 14 may be one of a layout learning model for a contract, a layout learning model for a bill, a layout learning model for a memorandum, a layout learning model for a delivery note, and a layout learning model for a receipt.

[0076] The types of elements may be classified into a necessary element and an unnecessary element according to the type of a document. In this case, the layout recognizing unit 33 may not acquire position information of the corresponding element when the recognized element out of the plurality of elements contained in the document image 61 corresponds to an unnecessary element and acquire position information the corresponding element when the recognized element corresponds to a necessary element. Alternatively, the layout recognizing unit 33 may recognize only a necessary element out of the plurality of elements contained in the document image 61 and acquire position information of the corresponding element.

[0077] The layout recognizing unit 33 recognizes types of elements, acquires position information of the ranges of the elements in the document image, and then corrects the ranges of the elements and the acquired position information on the basis of an actual document when the elements overlap or the elements are excessively separated.

[0078] An example of a correction process for mission resolution which is performed by the layout recognizing unit 33 when a recognition range recognized by the layout recognizing unit 33 has missing will be described below with reference to FIG. 9. Missing means that a part of a range to be recognized as an element by the layout recognizing unit 33 is not recognized and apart of the range of an element is missed. FIG.9isadiagram illustrating a correction process for missing resolution which is performed in the layout recognizing process, where FIG. 9(a) illustrates a state before the correction has been performed and FIG. 9(b) illustrates a state after the correction has been performed.

[0079] When an image 70 of a character string contained in the document image acquired by the document image acquiring unit 31 is recognized as a character string, the layout recognizing unit 33 performs a correction process of determining whether the recognition range thereof has missing and performs a correction process of adding the missed part when there is missing.

[0080] FIG. 9(a) illustrates a state in which the layout recognizing unit 33 recognizes an image 70 of a character string as a character string in a recognition range 72a. The recognition range 72a has missing in the left end part of the image 70 of a character string. The layout recognizing unit 33 determines whether a black line is within a predetermined range around the recognition range 72a and performs correction of adding a range 72b including the black line to the recognition range 72a when there is a black line (see FIG. 9(b)).

[0081] The determination performed by the layout recognizing unit 33 is not limited to a black line, but whether a line with the same color as characters or a line with a preset color is within a predetermined range near the recognition range 72a may be determined. This is because the correction process for missing resolution performed in the layout recognizing process is mainly for improving recognition accuracy of a character recognizing process which is performed subsequently.

[0082] With this correction process, even when the range of an element recognized by the layout recognizing unit 33 has missing, it is possible to correct the recognition range to a normal recognition range by adding the missed range thereto, and the character string recognizing unit 35 can normally recognize a character string contained in the corresponding element.

[0083] An example of correction which is performed by the layout recognizing unit 33 when a recognition range recognized by the layout recognizing unit 33 overlaps another

element will be described below with reference to FIG. 10. FIG. 10 is a diagram illustrating

a correction process for overlap resolution which is performed in the layout recognizing

process, where FIG. 10(a) illustrates a state before the correction has been performed and

FIG. 10(b) illustrates a state after the correction has been performed.

[0084] When an image 73 of a character string contained in the document image acquired by the document image acquiring unit 31 is recognized as a character string, the

layout recognizing unit 33 determines whether a recognition range 75a thereof overlaps

another element (for example, a table 74) and performs a correction process for resolving an

overlap when the overlap occurs.

[0085] FIG. 10(a) illustrates a state in which the layout recognizing unit 33

recognizes the image 73 of the character string as a character string in the recognition range

75a. The recognition range 75a overlaps a table 74 on the right side of the image 73 of the

character string over a blank (space). The layout recognizing unit 33 determines whether a

blank (space) with a predetermined size is present in the recognition range 75a and performs

correction for obtaining a recognition range 75b by deleting the recognition range 75a

associated with the blank (space) and a part on the right side of the blank (space) when the

blank (space) is present (see FIG. 10(b)).

[0086] Since a blank (space) with a predetermined size is necessarily present

between an element and another element, the layout recognizing unit 33 determines that the

recognition range overlaps another element when the blank (space) with a predetermined

size is present in the recognition range. With the correction process for overlap resolution

which is performed in the layout recognizing process, it is possible to improve recognition accuracy of a layout in the layout recognizing unit 33.

[0087] Processes which are performed by the layout recognizing unit 33 will be described below with reference to FIG. 11. FIG. 11 is a diagram illustrating layout

recognition which is performed in the layout recognizing process, where FIG. 11(a)

illustrates a state of a document image 61 before the layout recognition has been performed

thereon and FIG. 11(b) illustrates a state of a document image after the layout recognition

has been performed thereon.

[0088] The layout recognizing unit 33 identifies ranges in the document image 61 of elements (a character string 48, a table 49, a seal 52, and an image 53) contained in the

document image 61 through image recognition using the layout learning model 14.

[0089] In FIG. 11(b), for the purpose of convenient explanation, the range of the

identified character string 48 is surrounded by a solid line, and the ranges of the identified

table 49, the identified seal 52, and the identified image 53 are surrounded by dotted lines.

Since a boundary of an element has only to be recognized by the electronic document

generation device 10, it may not be visible to a person.

[0090] The layout recognizing unit 33 recognizes types of the corresponding elements through image recognition using the layout learning model 14 in the identified

ranges in the document image 61 and acquires position information of the ranges in the

document image 62 along with the types of the elements. The position information may be

expressed by a two-dimensional orthogonal coordinate system with a predetermined point in

the document image 62 as an origin.

[0091] The layout learning model 14 is set in advance according to the type of the

document image 61, and the layout recognizing unit 33 recognizes a layout of the document

image 61 using the layout learning model 14 set in advance.

[0092] That is, the layout recognizing unit 33 performs image recognition using the

layout learning model 14 for a contract when the document image 61 acquired by the

document image acquiring unit 31 is a contract, performs image recognition using the layout

learning model 14 for a bill when the document image 61 is a bill, performs image

recognition using the layout learning model 14 for a memorandum when the document image

61 is a memorandum, performs image recognition using the layout learning model 14 for a

delivery note when the document image 61 is a delivery note, and performs image

recognition using the layout learning model 14 for a receipt when the document image 61 is

a receipt.

[0093] Since the layout recognizing unit 33 uses the layout learning model 14

according to the type of the document image 61 acquired by the document image acquiring

unit 31, it is possible to improve recognition accuracy of a layout of the document image 61.

[0094] The extraction unit 34 (see FIG. 4) extracts each cell of a table contained in an element of which the type recognized by the layout recognizing unit 33 corresponds to a

table and acquires position information of the cells in the document image.

[0095] Recognition of a table 49 which is performed by the layout recognizing unit 33 will be described below with reference to FIG. 12. FIG. 12 is a diagram illustrating

recognition of a table which is performed in the layout recognizing process 56, where FIG.

12(a) illustrates a table 63 before recognition has been performed thereon by the layout

recognizing unit 33 and FIG. 12(b) illustrates a table 64 after recognition has been performed

thereon by the layout recognizing unit 33. In FIG. 12(b), for the purpose of convenient

explanation, lines recognized as vertical lines 65 are denoted by one-dot chain lines, and

lines recognized as horizontal lines 66 are denoted by dotted lines.

[0096] The layout recognizing unit 33 recognizes a length and a position of each of

all the vertical lines 65 and the horizontal lines 66 constituting the table 64. The layout

recognizing unit 33 recognizes all the cells contained in the table 64 by recognizing the

lengths and the positions of all the vertical lines 65 and the horizontal lines 66 constituting

the table 64. That is, the layout recognizing unit 33 recognizes a rectangle constituted by

two neighboring vertical lines 65 and two neighboring horizontal lines 66 as a cell.

[0097] The layout recognizing unit 33 also recognizes line types of the lines

constituting the table 64. The recognized line types are reflected in objects of lines

constituting a table contained in an electronic document when the electronic document is

reproduced on the basis of the acquired document image. For example, when lines of a

table in the document image 62 are dotted lines, the lines of the table contained in the electronic document reproduced on the basis of the document image 62 are expressed as objects of dotted lines.

[0098] The extraction unit 34 extracts an image of each cell of all the cells contained in the table 64 recognized by the layout recognizing unit 33.

Extraction of a cell pixel which is performed by the extraction unit 34 will be described

below with reference to FIG. 13. FIG. 13 is a diagram illustrating extraction of a cell image.

A cell 67 extracted by the extraction unit 34 may include a plurality of character strings.

[0099] The extraction unit 34 acquires an image of each cell and position information of the cell in the table 64 for all the cells contained in the table 64. The position

information may be expressed by a two-dimensional orthogonal coordinate system with a

predetermined point in the table 64 as an origin or may be expressed by (row, column) in the

table 64.

[0100] The extraction unit 34 reproduces all the vertical lines and the horizontal lines constituting the table recognized by the layout recognizing unit 33 and generates

position information of all the cells.

[0101] A cell 67 containing a plurality of character strings will be described below

with reference to FIG. 14. FIG. 14 is a diagram illustrating character strings in a cell image.

[0102] When the extracted cell 67 contains a plurality of rows of character strings, the extraction unit 34 additionally extracts an image for each character string for all the

character strings. The cell 67 illustrated in FIG. 14 contains two rows of character strings,

and the extraction unit 34 extracts an image 67a of a character string and an image 67b of a

character string.

[0103] The character string recognizing unit 35 (see FIG. 4) recognizes a character

string contained in the document image acquired by the document image acquiring unit 31

using the character string learning model 13 having learned correspondence between

document images and character strings contained in the document images and generates text

data of the character string.

[0104] The character string recognizing unit 35 may recognize a character string

contained in a range recognized by the layout recognizing unit 33 using the character string learning model 13 and generate text data of the character string.

[0105] The character string recognizing unit 35 may recognize each of character strings contained in the cells extracted by the extraction unit 34 using the character string

learning model 13 and generate text data of the character strings.

[0106] The character string recognizing unit 35 may include a plurality of character

string learning models 13 and use a character string learning model 13 adapted to the

languages of the character strings contained in the plurality of elements.

The character string recognizing unit 35 uses a character string learning model

appropriate for recognition of character strings in English to recognize a document image

written in English, whereby it is possible to improve recognition accuracy.

[0107] Character recognition which is performed by the character string recognizing unit 35 will be described below with reference to FIGS. 15 and 16.

FIG. 15 is a diagram illustrating arrangement of text data which is performed in the

character string recognizing process 57, where FIG. 15(a) illustrates an image 67a of a

character string before the character recognition has been performed thereon and FIG. 15(b)

illustrates a character string 68a, that is, text data 68a, after the character recognition has

been performed thereon.

[0108] FIG. 16 is a diagram illustrating noise removal which is performed in the character string recognizing process 57, where FIG. 16(a) illustrates an image 71a of a

illustrates a character string 71b, that is, text data 71b, after the character recognition has

been performed thereon.

[0109] The image 67a of the character string illustrated in FIG. 15(a) contains a

handwritten check mark in addition to one row of character string. The character string

contains a blank between a word and a word. The character string recognizing unit 35

recognizes the whole image 67a of the character string using the character string learning

model 13 and generates text data.

[0110] The character string recognizing unit 35 recognizes two wordings "L/C NO:"

and "ILC18H000219" and a blank between the two wordings in the image 67a of the character string and generates text data corresponding to the two wordings and text data corresponding to the blank between the two wordings (68a: see FIG. 15(b)).

Accordingly, since the character string recognizing unit 35 recognizes a space between

wordings and converts the space to text data, it is possible to arrange the two wordings

separately similarly to the image 67a.

[0111] Since the character string recognizing unit 35 does not recognize a

handwritten check mark and does not add the handwritten check mark to text data when

recognizing the image 67a of the character string, the handwritten check mark is deleted

from the electronic document which is output (68a: see FIG. 15(b)). Accordingly, noise

such as the handwritten check mark which is not recognized by the character string

recognizing unit 35 is passively removed from the electronic document.

[0112] In the image 71a of a character string illustrated in FIG. 16(a), a part of a

seal overlapping one row of character string remains as noise. The character string

recognizing unit 35 recognizes the whole image 71a of a character string using the character

string learning model 13 and generates text data.

[0113] The character string recognizing unit 35 recognizes the whole image 71a of

a character string and generates text data corresponding to the character string "authorized

to act on behalf of the" (71b: see FIG. 16(b)).

[0114] Since noise contained in the image 71a of a character string is not

recognized by the character string recognizing unit 35, the noise is passively removed from

the electronic document (71b: see FIG. 16(b)).

[0115] A document image and text data of a character string after character

recognition has been performed on the document image have been described above with

reference to FIGS. 15 and 16, but the character string learning model 13 can learn a plurality

of pieces of data in which FIG. 15(a) and FIG. 15(b) are correlated with each other and a

plurality of pieces of data in which FIG. 16(a) and FIG. 16(b) are correlated with each other

as training data, whereby it is possible to embody character recognition from an image using

deep learning.

[0116] The character string recognizing unit 35 may acquire attribute data such as sizes and fonts of characters contained in character strings when the character strings contained in the images 67a and 71a are recognized using the character string learning model

13. The attribute data of characters is reflected as attribute data of text data output from the

output unit 36 which will be described later.

[0117] The output unit 36 (see FIG. 4) outputs text data as text of an electronic

medium.

The output unit 36 may the output position information of the ranges of a plurality of

elements and text data of the plurality of elements as text of an electronic medium.

[0118] An electronic medium is not limited to data electrically stored in a recording medium and may include data which is not stored in a recording medium but of which details

can be handled by an information processing device such as a PC.

Position information of an element may be expressed by a two-dimensional orthogonal

coordinate system with a predetermined point in the document image 62 as an origin.

[0119] Since the output unit 36 outputs text data of a plurality of elements on the basis of the position information of the elements, it is possible to remove noise, to convert a

character string in the document image 61 to text data, and to output an electronic document

while maintaining the layout of the acquired document image 61.

[0120] The output unit 36 may reflect attribute data of characters acquired by the character string recognizing unit 35 in text data and output the text data as an electronic

document. With the output unit 36, the electronic document generation device 10 can

reproduce attribute data such as sizes and fonts of characters contained in the document

image 61 as attribute data of text data contained in the electronic document to be output.

[0121] The layout learning data generating unit 40 (see FIG. 4) adds an annotation

associated with the type of each element in a document image containing a plurality of

elements to the corresponding element and accumulates a plurality of document images with

annotations to generate layout learning data.

[0122] The layout learning data is used for supervised learning of the layout

learning model 14.

Position information of ranges of the plurality of elements contained in a document image in the document image along with an annotation may be added to the document images accumulated in the layout learning data.

[0123] The layout learning data with annotations will be described below with

reference to FIGS. 17 to 23. FIGS. 17 to 23 are diagrams illustrating examples of layout

learning data with annotations.

[0124] The layout learning data generating unit 40 acquires a document image from

the document image database 15, adds annotations to the document image, and generates

layout learning data. When the layout learning data is generated, a user may manually

generate the layout learning data without using the layout learning data generating unit 40.

When a user manually generates the layout learning data, annotations can be added to a

document image acquired from the document image database 15 using the user terminal 12.

[0125] Layout learning data used for leaning of the layout learning model 14 for a

bill will be described below with reference to FIGS. 17 and 18. Annotation symbols are

added to elements contained in a document image such that a character string, a table, an

image, a seal, an outer line, and noise which are the elements can be identified and classified

by the electronic document generation device 10.

[0126] An annotation symbol 76 of a character string is added to an element

associated with a character string, the character string is surrounded by a rectangular

enclosing line, and a tag "Text" is added as a mark to the enclosing line. Apartsurrounded

by the rectangular enclosing line is learned as a range of the element associated with the

character string in the document image by the layout learning model 14.

[0127] An annotation symbol 77 of a table is added to an element associated with

a table, a rectangular enclosing line is superimposed on the outer line of the table, and a tag

"Border Table" is added as a mark to the enclosing line. A part surrounded by the

rectangular enclosing line is learned as a range of the element associated with the table in

the document image by the layout learning model 14.

[0128] An annotation symbol 78 of an image is added to an element associated with

an image, an enclosing line indicating the annotation symbol is superimposed on a boundary

line of the image, and a tag "Image" is added as a mark to the enclosing line. It is assumed that the image includes a logo, a mark, a photograph, and an illustration. Apartsurrounded by the enclosing line is learned as a range of the element associated with the image in the document image by the layout learning model 14.

[0129] An annotation symbol 79 of a seal is added to an element associated with a seal, an enclosing line indicating the annotation symbol is superimposed on a boundary line

of the seal, and a tag "Hun" is added as a mark to the enclosing line. A part surrounded by

the enclosing line is learned as a range of the element associated with the seal in the document

image by the layout learning model 14.

[0130] An annotation symbol 80 of an outer line is added to an element associated with an outer line, an enclosing line is superimposed on a boundary line of the outer line,

and a tag "Border" is added as a mark to the enclosing line. Lengths and positions of four

segments constituting the enclosing line are learned by the layout learning model 14.

[0131] An annotation symbol 81 of noise is added to an element associated with noise, the noise is surrounded by a rectangular enclosing line, and a tag "Noise" is added as

a mark to the enclosing line. A part surrounded by the enclosing line is learned as a range

of the element associated with the noise in the document image by the layout learning model

14.

[0132] Layout learning data used for learning of recognition of a table will be described below with reference to FIG. 19. In a table contained in the document image

acquired from the document image database 15, a one-dot chain line which is an annotation

symbol 83 of a vertical line is superimposed on all the vertical lines constituting the table,

and a dotted line which is an annotation symbol 84 of a horizontal line is superimposed on

all the horizontal lines constituting the table.

[0133] The layout learning model 14 can learn the size of the table, the range of the

table, the position thereof, and information of all the cells of the table by recognizing all the

one-dot chain lines and all the dotted lines. Information of the cells includes the number of

cells in the table and positions of the cells in the table, and the position of a cell in the table

is expressed by (row, column) of the table.

[0134] Layout learning data used for learning of recognition of a character string in a cell of a table will be described below with reference to FIGS. 20 and 21. FIG. 20 illustrates layout learning data in which one row of character string is contained in each cell.

FIG. 21 illustrates layout learning data for recognizing a table containing a cell containing

one row of character string, a cell containing two rows of character strings, and a cell

containing three rows of character strings.

[0135] As illustrated in FIGS. 20 and 21, the annotation symbol 76 of a character string is added to each character string regardless of the number of rows of character strings

contained in one cell, the character string is surrounded by a rectangular enclosing line, and

a tag "Text" is added as a mark to the enclosing line.

[0136] The layout learning model 14 learns the ranges of the annotation symbol 76

of the character strings and positions of the character strings in the table. The electronic

document generation device 10 can reproduce the table by outputting text data of the

character strings as an electronic document along with object data associated with all the

vertical lines and the horizontal lines constituting the table.

[0137] Layout learning data used for learning of recognition of a character string

in a cell of a table will be described below with reference to FIG. 22. The annotation

symbol 76 of a character string is added to each character string contained in a cell of the

table illustrated in FIG. 22, the character string is surrounded by a rectangular enclosing line,

and a tag "Text" is added as a mark to the enclosing line.

[0138] The layout learning model 14 learns the ranges of the annotation symbol 76

of the character strings and position information of the character strings in the document.

The electronic document generation device 10 can reproduce the table in the electronic

document by locating the text data of the character strings at positions in the document. The

electronic document generation device 10 can reproduce the table in the electronic document

by only outputting text data without reproducing all the vertical lines and the horizontal lines

constituting the table in the electronic document.

[0139] Layout learning data used for learning of recognition of a seal will be

described below with reference to FIG. 23. By using the layout learning model illustrated

in FIG. 23, the layout learning model 14 can learn a range and a position of a seal on the basis of the element of the character string and a blank located below the character string without using the element corresponding to the seal.

[0140] The layout learning data correcting unit 41 (see FIG. 4) corrects at least one of a type of each of a plurality of elements acquired by the layout recognizing unit 33 and position information of a range of each of the plurality of elements in the document image on the basis of an input and updates the layout learning data by adding the corrected data thereto.

[0141] A difference may occur between a document image 61 before image recognition has been performed thereon by the layout recognizing unit 33 and a document image 62 after image recognition has been performed thereon by the layout recognizing unit 33. Examples of such a case include a case in which a part of a character string is not recognized, a case in which an element to be recognized as an image is recognized as a seal, and a case in which a position of a table is shifted.

[0142] In this case, the layout learning data is updated by correcting the document image 62 after image recognition has been performed thereon by the layout recognizing unit 33 such that it matches the document image 61 before image recognition has been performed thereon by the layout recognizing unit 33 and adding the corrected data to the layout learning data.

[0143] The layout learning unit 42 (see FIG. 4) performs retraining of the layout learning model 14 using the layout learning data updated by the layout learning data correcting unit 41. When the layout learning model 14 is retrained, it is possible to improve recognition accuracy of a layout of a document image.

[0144] The character string learning data generating unit 43 (see FIG. 4) generates character string learning data used for supervised learning of the character string learning model 13. The character string learning data correcting unit 44 (see FIG. 4) updates the character string learning data by correcting text data generated by the character string recognizing unit 35 on the basis of an input and adding the corrected text data thereto.

[0145] The character string learning unit 45 (see FIG. 4) performs retraining of the character string learning model 13 using the character string learning data updated by the character string learning data correcting unit 44.

[0146] The character string learning data generating unit 43 generates character string learning data by acquiring a document image from the document image database 15 and adding annotations to the document image. When the character string learning data is generated, a user may manually generate the character string learning data without using the character string learning data generating unit 43. When a user manually generates the character string learning data, the user can add annotations to the document image acquired from the document image database 15 using the user terminal 12.

[0147] Character string learning data used for learning of the character string learning model 13 will be described below with reference to FIG. 24. FIG. 21 is a diagram illustrating an example of character string learning data with annotations. FIG. 24 illustrates an output screen of the character string learning data generating unit 43 which is displayed on the user terminal 12 or the output device 27 of the electronic document generation device 10.

[0148] The character string learning data generating unit 43 adds text data corresponding to each character string as an annotation 85 of the text data to the character string contained in the document image acquired from the document image database 15.

[0149] Instead of text data which is added as annotations, corresponding character codes may be added as the annotation 85 of the text data. When a character string contained in the document image includes a blank, the character string learning data generating unit 43 generates the character string learning data such that the text data corresponding to the character string similarly includes a blank.

[0150] An electronic document generation method which is performed by the electronic document generation device 10 according to this embodiment will be described below with reference to FIG. 25 along with an electronic document generation program. FIG. 25 illustrates a flowchart of the electronic document generation program. The electronic document generation method is performed by the CPU 25 of the electronic document generation device 10 in accordance with the electronic document generation program.

[0151] The electronic document generation program causes the CPU 25 of the electronic document generation device 10 to embody various functions such as a document

image acquiring function, a preprocessing function, a layout recognizing function, an

extraction function, a character recognizing function, and an output function. These

functions are performed in the order illustrated in FIG. 25, and may be appropriately

performed in the changed order. These functions are the same as in the aforementioned

description of the electronic document generation device 10, and thus detailed description

thereof will be omitted.

[0152] The document image acquiring function acquires a document image by imaging a document (S31: a document image acquiring step).

A format of the document image may be, for example, PDF, JPG, or GIF, and may

include a data format which can be processed as an image by the electronic document

generation device 10.

[0153] The preprocessing function performs preprocessing on the document image acquired by the document image acquiring function (S32: a preprocessing step).

The preprocessing function includes a background eliminating function, a tilt

correcting function, and a shape adjusting function, the background eliminating function

removes the background of the document image acquired by the document image acquiring

function, the tilt correcting function corrects a tilt of the document image acquired by the

document image acquiring function, and the shape adjusting function adjusts a shape and a

size of the document image as a whole acquired by the document image acquiring function.

[0154] The layout recognizing function identifies a range of each of a plurality of

elements contained in the document image acquired by the document image acquiring

function in the document image using a layout learning model 14 having learned

correspondence between a plurality of elements contained in document images and

identification information of the plurality of elements, recognizes a type of each of the

plurality of elements, and acquires position information of a range of each of the plurality of elements in the document image (S33: a layout recognizing step).

[0155] Types of elements can be classified into a necessary element and an unnecessary element according to the type of the document. In this case, the layout recognizing function may not acquire position information of a recognized element when the recognized element out of the plurality of elements contained in the document image acquired by the document image acquiring function corresponds to an unnecessary element and acquire position information of the recognized element when the recognized element corresponds to a necessary element. Alternatively, the layout recognizing function may recognize only necessary elements out of the plurality of elements contained in the document image 61 and acquire position information of the elements.

[0156] The layout recognizing function recognizes a type of each element, acquires position information of the range of each element in the document image, and then corrects the range of each element and the acquired position information on the basis of an actual document when the elements overlap each other or when the elements are excessively separated.

[0157] The layout recognizing function recognizes lengths and positions of all vertical lines and horizontal lines constituting a table. The layout recognizing function recognizes all the cells contained in the table by recognizing the lengths and the positions of all the vertical lines and the horizontal lines constituting the table. The layout recognizing function recognizes a rectangle constituted by two neighboring vertical lines and two neighboring horizontal lines as a cell.

[0158] The layout recognizing function also recognizes line types of lines constituting the table. The recognized line types are reflected in objects of lines constituting the table contained in an electronic document when the electronic document is reproduced on the basis of the acquired document image. Accordingly, for example, when a line of a table in the document image is a dotted line, the line of the table contained in the electronic document reproduced on the basis of the document image is expressed as an object of a dotted line.

[0159] The extraction function extracts each cell in a table contained in an element of which a type recognized by the layout recognizing function corresponds to a table and acquires position information of each cell in the document image (S34: an extraction step).

The extraction function reproduces all the vertical lines and the horizontal lines

constituting the table recognized by the layout recognizing function and generates position

information of all the cells.

[0160] A cell extracted by the extraction function may include a plurality of character strings. When a plurality of rows of character strings are contained in the

extracted cell, the extraction function extracts an image of each character string of all the

character strings.

[0161] An image of a character string recognized by the layout recognizing

function and an image of a character string extracted by the extraction function are sent to

the character recognizing function for each row.

[0162] The character recognizing function recognizes a character string contained in the document image acquired by the document image acquiring function using a character

string learning model having learned correspondence between document images and

character strings contained in the document image and generates text data of the character

string (S35: a character recognizing step).

[0163] The output function outputs the text data as text of an electronic medium

(Step S36: an output step).

The output function outputs the text data on the basis of the position information of the

character strings acquired by the layout recognizing function and the position information of

the cells acquired by the extraction unit in the document image and reproduces the text data

as text of the electronic medium.

[0164] The electronic document generation program in an example in which a

document image of a receipt is converted to an electronic document will be described below

with reference to FIGS. 26 to 28. FIGS. 26 to 28 are flowcharts illustrating the electronic

document generation program according to the embodiment. The flowcharts illustrated in

FIGS. 26 to 28 are combined to form one flowchart of the electronic document generation

program.

[0165] In Step S102, the document image acquiring unit 31 acquires a document image or a PDF from the document image database 15.

In Step S103, it is determined whether data acquired by the document image acquiring

unit 31 is a PDF. When the acquired data is not a PDF (S103: NO), that is, when the data

acquired by the document image acquiring unit 31 is document image, the process flow

proceeds to Step S106.

[0166] When the data acquired by the document image acquiring unit 31 is a PDF (S103: YES), the PDF is converted to a document image in Step S104 and then the document

image is acquired (S105).

[0167] In Step S106, the preprocessing unit 32 performs preprocessing on the

acquired document image. The preprocessing unit 32 includes the background eliminating

unit 32a, the tilt correcting unit 32b, and the shape adjusting unit 32c.

[0168] The background eliminating unit 32a eliminates the background of the acquired document image. When a character string contained in the acquired document

image is tilted, the tilt correcting unit 32b corrects the tilt of the character string by

performing tilt correction. The shape adjusting unit 32c adjusts the shape and the size of

the acquired document image as a whole.

[0169] In Step S107, the layout recognizing unit 33 acquires the document image subjected to the preprocessing performed by the preprocessing unit 32.

The acquired preprocessed document image is sent to a document image extracting

process of Steps S115, S120, and S136 which will be described later.

[0170] In Steps S108 and S109, the layout recognizing unit 33 performs layout recognition of the document image, identifies a range of each of a plurality of elements

contained in the document image, and acquires a type and position information of each

element.

The type of each element is a character string, a table, an image, a seal, or handwriting.

[0171] In Step S110, the layout recognizing unit 33 performs a process of adjusting

position information of a minimum boundary box of the acquired element.

The minimum boundary box means a rectangle with a minimum area out of rectangles surrounding the element and means a range occupied by the element. The layout recognizing unit 33 compares the document image with the acquired element and adjusts the position information of the minimum boundary box of the acquired element when there is displacement between the document image and the acquired position information of the element.

[0172] In Step Sil, the layout recognizing unit 33 acquires layout information subjected to the minimum boundary box adjusting process performed in Step S110. The layout information includes types and position information of elements.

[0173] In Step S112, the layout recognizing unit 33 determines whether another element remains in the document image with reference to the internally stored layout information of the elements from the process of Step S130 which will be described later.

[0174] When the internally stored layout information of the elements from the process of Step S130 includes layout information of all the elements, the layout recognizing unit 33 determines that another element does not remain in the document image (S112: NO), performs a process of ending the loop from Step S112 to Step S130 in Step S131 and then performs the process of Step S132.

[0175] In contrast, when the internally stored layout information of the elements from the process of Step S130 does not include layout information of all the elements, the layout recognizing unit 33 determines that another element remains in the document image (S112: YES) and then performs the process of Step S113.

[0176] In Step S113, the layout recognizing unit 33 determines whether the element remaining in the document image is a table. When a table does not remain in the document image (S113: NO), the layout recognizing unit 33 sends layout information other than a table to Step S130 which will be described later.

[0177] When a table remains in the document image (S113: YES), the process flow proceeds to Step S114. The document image is associated with a receipt and thus often includes a table. Accordingly, when it is determined that the document image does not include a table, the layout recognizing unit 33 may stop the process flow and ascertain whether the electronic document is associated with a receipt.

[0178] In Step S114, the layout recognizing unit 33 acquires sizes and position information of all the vertical lines and the horizontal lines constituting the table in the

document image. When the sizes and the position information of all the vertical lines and

the horizontal lines constituting the table are acquired, the sizes and the positions of all the

cells contained in the table can be acquired.

[0179] In Step S115, the extraction unit 34 performs a process of extracting an image of the table from the preprocessed document image acquired in the process of Step

S107.

In Step S116, the extraction unit 34 acquires an image of the table extracted in Step

S115.

[0180] In Steps S117 and S118, the extraction unit 34 performs a process of extracting cells from the image of the table acquired in Step S116 (Step S117) and acquires

information of the cells (Step S118).

[0181] The information of a cell includes a row, a column, and coordinates

corresponding to the position information of the cell in the table.

The information of cells acquired in Step S118 is sent to Step S127 which will be

described later.

[0182] In Step S119, the extraction unit 34 determines whether another cell remains

in the table with reference to the internally stored layout information of the table from the

process of Step S127.

[0183] When the internally stored layout information of the table from the process

of Step S127 includes layout information of all the cells, the extraction unit 34 determines

that another cell does not remain in the table (S119: NO), a process of ending the loop from

Step S119 to Step S127 is performed in Step S128, and then the process flow proceeds to

Step S130.

[0184] In contrast, when the internally stored layout information of the cell from

the process of Step S127 does not include layout information of all the cells, the extraction unit 34 determines that another cell remains in the table (S119: YES), and the process flow proceeds to Step S120.

[0185] In Step S120, the extraction unit 34 performs a process of extracting images of cells from the preprocessed document image acquired in the process of Step S107. In Step S121, the extraction unit 34 acquires the images of the cells extracted in the process of Step S120.

[0186] In Step S122, the character string recognizing unit 35 performs a character string recognizing process on the images of the cells acquired in the process of Step S121. In Step S123, the character string recognizing unit 35 acquires position information of the character string subjected to the character string recognizing process.

[0187] In Step S124, the character string recognizing unit 35 performs a process of adjusting position information of the minimum boundary box of the character string acquired in the process of Step S123. The character string recognizing unit 35 compares the acquired position information of the character string with the document image and adjusts the acquired position information of the minimum boundary box of the character string when there is a difference between the document image and the acquired position information of the character string.

[0188] In Step S125, the character string recognizing unit 35 acquires position information after the process of adjusting the position information of the minimum boundary box of the character string performed in Step S124 has been performed thereon.

[0189] In Steps S126 and S127, the character string recognizing unit 35 combines the information of the cell acquired in the process of Step S118 and the adjusted position information of the character string acquired in the process of Step S125 (Step S127) and internally stores the combined information as layout information of the table in an internal storage device (Step S126). The internal storage device is one or both of the RAM 23 and the storage unit 24 illustrated in FIG. 2.

[0190] The processes of Steps S119 to S127 are performed on all the cells contained in the table. The process of ending the loop of Step S128 is performed after the processes of Steps S119 to S127 have been performed a final cell in the table, and the character string recognizing unit 35 performs the process of Step S130.

[0191] In Steps S129 and S130, the output unit 36 combines the layout information of the table acquired in the process of Step S126 and the layout information other than the table acquired in the process of Step S113 (Step S130) and internally stores the combined information in the internal storage device as layout information of all the elements (Step S129).

[0192] The processes of Steps S112 to S130 are performed on all the elements contained in the document image. The process of ending the loop of Step S131 is performed after the processes of Steps S112 to S130 have been performed on a final element in the document image, and the character string recognizing unit 35 performs the process of Step S132.

[0193] In Step S132, the character string recognizing unit 35 determines whether another element remains in the document image. The character string recognizing unit 35 determines whether another element remains in the document image with reference to the internally stored layout information of the elements from the process of Step S140 which will be described later.

[0194] When the internally stored layout information of the elements from the process of Step S140 includes layout information of all the elements, the character string recognizing unit 35 determines that another element does not remain in the document image (S132: NO) and performs a process of ending the loop from Step S132 to Step S140 in Step S141, and the process flow proceeds to Step S142.

[0195] In contrast, when the internally stored layout information of the elements from the process of Step S140 includes layout information of all the elements, the character string recognizing unit 35 determines that another element remains in the document image (S132: YES), and the process flow proceeds to Step S133.

[0196] In Step S133, the character string recognizing unit 35 determines whether an element remaining in the document image is a character string. When it is determined that the element remaining in the document image is a character string (S133: YES), the character string recognizing unit 35 proceeds to Step S135.

[0197] When the character string recognizing unit 35 determines that the element remaining in the document image is not a character string (S133: NO), a loop resuming

process proceeding to Step S132 is performed (Step S134).

In Step S135, the character string recognizing unit 35 acquires the position information

of the character string.

[0198] In Steps S136 and S137, the character string recognizing unit 35 extracts an image of the character string from the preprocessed document image acquired in the process

of Step S107 (Step S136) and acquires the image of the character string.

[0199] In Step S138 and S139, the character string recognizing unit 35 performs a character string recognizing process on the image of the character string acquired in the

process of Step S137 (Step S138) and generates text data predicted in the character string

recognizing process (Step S139).

[0200] In Step S140, the character string recognizing unit 35 combines the position information of the character string acquired in the process of Step S135 and the text data

generated in the process of Step S139 to generate layout information of the elements. The

generated layout information of the elements is sent to Step S129. In Step S129, the sent

layout information of the elements is internally stored in the internal storage device. The

internal storage device means one or both of the RAM 23 and the storage unit 24 illustrated

in FIG. 2.

[0201] The processes of Steps S132 to S140 are performed until it is determined in

Step S132 that the internally stored layout information of the elements from the process of

Step S140 includes layout information of all the elements.

[0202] When it is determined in Step S132 that the internally stored layout

information of the elements from the process of Step S140 includes layout information of all

the elements, a process of ending the loop from Step S132 to Step S140 is performed in Step

S141, and the process flow proceeds to Step S142.

[0203] In Step S142, the electronic document generation device 10 performs post

processing. In the post-processing, output to Java script object notation (JSON), and

conversion to tab-separated values (TSV), for example, are performed on the text data, the images, and the position information of all the elements.

The aforementioned processes of the functional units are performed by the CPU 25 of

the electronic document generation device 10.

[0204] In Step S143, the output unit 36 outputs information of all the elements subjected to the post-processing as a final electronic document in a simple text file, an

hypertext markup language (HTML) format, a file format that can be edited by commercially

available character editing software, or in an editable PDF format.

[0205] According to the embodiment, the electronic document generation device 10 recognizes a layout of a document image using the layout learning model 14 and

recognizes characters of the document image using the character string learning model 13.

That is, since the electronic document generation device 10 identifies types of a plurality of

elements contained in the document image and performs character recognition appropriate

for the type of an element, it is possible to improve recognition accuracy of character

recognition.

[0206] According to the embodiment, since the electronic document generation

device 10 performs character recognition of the document image for each character string

using the character string learning model 13 in comparison with character recognition for

each character having been performed using the OCR text recognition technique according

to the related art, it is possible to improve recognition efficiency at the time of character

recognition.

[0207] According to the embodiment, since the electronic document generation

device 10 performs character recognition for each character string instead of character

recognition for each character at the time of character recognition, it is possible to perform

character recognition with a reduced influence of noise overlapping characters, for example,

and to improve recognition accuracy of character recognition in comparison with character

recognition for each character.

[0208] According to the embodiment, even characters which are likely to be

erroneously recognized in character recognition using the OCR text recognition technique

according to the related art can be correctly recognized in character recognition using the character string learning model 13. For example, when a seal overlaps a character, the character is likely to be erroneously recognized in character recognition using the OCR text recognition technique according to the related art, but can be correctly recognized in character recognition using the character string learning model 13.

[0209] According to the embodiment, since the electronic document generation

device 10 performs character recognition for each character string contained in an image of

each cell in an element of which the type corresponds to a table, it is possible to improve

recognition accuracy of character recognition of a character string contained in the table.

[0210] According to the embodiment, since the character string learning model 13 and the layout learning model 14 are trained using the character string learning data and the

layout learning data with annotations, it is possible to improve recognition accuracy of the

layout recognizing unit 33 and the character string recognizing unit 35.

[0211] According to the embodiment, when a document image includes a table, all the vertical lines and the horizontal lines constituting the table are first recognized and then

all the cells in the table are recognized. Thereafter, since character string recognition for

each cell is performed on all the cells without being affected by the position information in

the table, it is possible to improve recognition accuracy of character recognition of a

character string in a cell.

[0212] The present disclosure is not limited to the electronic document generation

device 10 according to the embodiment, and can be carried out as various modified examples

or application examples without departing from the gist of the present disclosure described

in the appended claims.

[Reference Signs List]

[0213]

10 Electronic document generation device

11 Information communication network

12 User terminal

13 Character string learning model

14 Layout learning model

15 Document image database

20 Input/output interface

21 Communication interface

22 ROM

23 RAM

24 Storage unit

25 CPU

26 Input device

27 Output device

28 GPU

31 Document image acquiring unit

32 Preprocessing unit

32a Background eliminating unit

32b Tilt correcting unit

32c Shape adjusting unit

33 Layout recognizing unit

34 Extraction unit

35 Character string recognizing unit

36 Output unit

40 Layout learning data generating unit

41 Layout learning data correcting unit

42 Layout learning unit

43 Character string learning data generating unit

44 Character string learning data correcting unit

45 Character string learning unit

47 Pre-tilt-correction document image

48 Character string

49 Table

50 Stapler mark

51 Handwriting

52 Seal

53 Image

54 Noise removal

55 Preprocessing

56 Layout recognizing process

57 Character string recognizing process

58a, 59a, 60a Document image

58b, 59b, 60b Document image

61, 62 Document image

63, 64 Table 65 Vertical line

66 Horizontal line

67 Cell image

69, 70, 73 Image of character string

71a Image of character string

71b Text data

72 Recognition range

73 Table

75 Recognition range

76 Annotation symbol of character string

77 Annotation symbol of table

78 Annotation symbol of image

79 Annotation symbol of seal

80 Annotation symbol of outer line

81 Annotation symbol of noise

82 Annotation symbol of handwriting

83 Annotation symbol of vertical line

84 Annotation symbol of horizontal line

85 Annotation of text data

100 Electronic document generation system

S31 Document image acquiring step

S32 Preprocessing step

S33 Layout recognizing step

S34 Extraction step

S35 Character recognizing step

S36 Output step

Claims

1. An electronic document generation device comprising:

a document image acquiring unit configured to acquire a document image

obtained by imaging a document;

a character string recognizing unit configured to recognize a character string

contained in the document image acquired by the document image acquiring unit using a

character string learning model having learned correspondence between document images

and character strings contained in the document images and to generate text data of the

character string; and

an output unit configured to output the text data as text of an electronic medium.

2. The electronic document generation device according to claim 1, further

comprising a layout recognizing unit configured to identify a range of each of a plurality of

elements contained in the document image acquired by the document image acquiring unit

in the document image using a layout learning model having learned correspondence

between a plurality of elements contained in document images and identification

information of the plurality of elements, to recognize a type of each of the plurality of

elements, and to acquire position information of each of the plurality of elements in the

document image associated of the range,

wherein the character string recognizing unit recognizes a character string

contained in the range recognized by the layout recognizing unit using the character string

learning model and generates text data of the character string, and

wherein the output unit outputs the text data associated with the plurality of

elements in the position information of the range associated with the plurality of elements

as text of an electronic medium.

3. The electronic document generation device according to claim 2, wherein the

type of each element is one of a character string, a table, an image, a seal, and handwriting.

4. The electronic document generation device according to claim 3, further

comprising an extraction unit configured to extract each of cells in a table included in an

element of which the type recognized by the layout recognizing unit corresponds to a table

and to acquire position information of each cell in the document image,

wherein the character string recognizing unit recognizes a character string

included in each cell extracted by the extraction unit using the character string learning

model and generates text data of the character string.

5. The electronic document generation device according to any one of claims 2 to

4, wherein annotations associated with the types corresponding to the elements are given to

the elements in the document image including the plurality of elements,

wherein the electronic document generation device further comprises a layout

learning data generating unit configured to accumulate a plurality of the document images

to which the annotations are given to generate layout learning data, and

wherein the layout learning data is used for supervised learning of the layout

learning model.

6. The electronic document generation device according to claim 5, wherein

position information of ranges associated with the plurality of elements included in the

document image in the document image along with the annotations is given to the

document image.

7. The electronic document generation device according to claim 5 or 6, further

comprising a layout learning data correcting unit configured to correct at least one of the

types of the plurality of elements recognized by the layout recognizing unit and the

position information of the ranges of the plurality of elements in the document image on

the basis of an input and to update the layout learning data by adding the corrected data.

8. The electronic document generation device according to claim 7, further

comprising a layout learning unit configured to perform re-learning of the layout learning

model using the layout learning data updated by the layout learning data correcting unit.

9. The electronic document generation device according to any one of claims 2 to

8, further comprising a character string learning data generating unit configured to generate

character string learning data which is used for supervised learning of the character string

learning model.

10. The electronic document generation device according to claim 9, further

comprising a character string learning data correcting unit configured to correct text data

generated by the character string recognizing unit on the basis of an input and to update the

character string learning data by adding the corrected text data.

11. The electronic document generation device according to claim 10, further

comprising a character string learning unit configured to perform re-learning of the

character string learning model using the character string learning data updated by the

character string learning data correcting unit.

12. The electronic document generation device according to any one of claims 2

to 11, wherein the character string recognizing unit includes a plurality of the character

string learning models and uses the character string learning models adapted to languages

of the character strings included in the plurality of elements.

13. The electronic document generation device according to any one of claims 2

to 12, further comprising a preprocessing unit configured to perform preprocessing on the

document image acquired by the document image acquiring unit,

wherein the preprocessing unit includes a background eliminating unit, a tilt

correcting unit, and a shape adjusting unit, wherein the background eliminating unit eliminates a background of the document image acquired by the document image acquiring unit, wherein the tilt correcting unit corrects a tilt of the document image acquired by the document image acquiring unit, and wherein the shape adjusting unit adjusts a shape and a size of the document image as a whole acquired by the document image acquiring unit.

14. The electronic document generation device according to any one of claims 2

to 13, wherein the layout learning model is one of a layout learning model for a contract, a

layout learning model for a bill, a layout learning model for a memorandum, a layout

learning model for a delivery note, and a layout learning model for a receipt.

15. An electronic document generation method that is performed by a computer

used for an electronic document generation device, the electronic document generation

method comprising:

a document image acquiring step of acquiring a document image obtained by

imaging a document;

a character string recognizing step of recognizing a character string contained in

and

an output step of outputting the text data as text of an electronic medium.

16. An electronic document generation program causing a computer used for an

electronic document generation device to perform:

a document image acquiring function of acquiring a document image obtained by

imaging a document;