WO2021194075A1

WO2021194075A1 - Document recognition method and device therefor

Info

Publication number: WO2021194075A1
Application number: PCT/KR2021/000860
Authority: WO
Inventors: 서동완; 이진곤; 황진하
Original assignee: 주식회사 신한디에스
Priority date: 2020-03-23
Filing date: 2021-01-22
Publication date: 2021-09-30
Also published as: KR102256667B1

Abstract

Disclosed are a document recognition method and a device therefor. The document recognition device: defines one or more reference extraction areas and multiple reference feature areas in a reference document; detects multiple feature areas from a digital document; generates an extraction area obtained by changing the position, the slope, the size, or the shape of a reference extraction area on the basis of differences between coordinates of the reference feature areas and those of the feature areas detected from the digital document; and recognizes and stores information located in the extraction area in the digital document.

Description

Document recognition method and device

An embodiment of the present invention relates to a method for recognizing a document and an apparatus therefor, and more particularly, to a method and apparatus for recognizing information written in a document such as paper to make computer-readable data.

With the recent development of computer-related technologies, most of the work of each company is going digital. Also, with the development of technologies such as big data and artificial intelligence, interest is growing in how to convert the information stored in writing in the past into computer-readable data.

Conventionally, documents such as paper are scanned and converted into digital documents and stored. However, since various deformations (eg, warping, pushing, shrinking, etc.) occur in the process of digitizing a document, there is a problem in that the recognition accuracy of information in the document is lowered.

An aspect of the present invention is to provide a method and an apparatus for accurately recognizing information in a digital document even when deformation occurs in the process of digitizing the document.

In order to achieve the above technical object, an example of a document recognition method according to an embodiment of the present invention includes the steps of defining at least one reference feature area and at least one reference extraction area in a reference document; detecting at least one feature area in the digital document; generating an extraction region in which the position, slope, size or shape of the reference extraction region is changed based on a difference in coordinates between the reference feature region and the feature region detected in the digital document; and recognizing and storing information located in the extraction area in the digital document.

In order to achieve the above technical object, an example of a document recognition apparatus according to an embodiment of the present invention includes: a reference setting unit defining at least one reference feature area and at least one reference extraction area in a reference document; a feature detection unit for detecting at least one feature region in a digital document; an extraction region grasping unit generating an extraction region in which a position, inclination, size or shape of the reference extraction region is changed based on a difference in coordinates between the reference feature region and the feature region detected in the digital document; and an information recognition unit for recognizing and storing information located in the extraction area in the digital document.

According to an embodiment of the present invention, even if deformation occurs in the process of digitizing the document, it is possible to accurately recognize the information of the digital document. In addition, it is possible to selectively recognize only desired information by designating a partial area rather than the entire document, thereby providing convenience and efficiency to the user.

1 is a view showing the concept of a document recognition method according to an embodiment of the present invention;

2 is a view showing an example of setting a reference feature area and a reference extraction area according to an embodiment of the present invention;

3 is a view showing an example of setting a reference feature area, etc. according to an embodiment of the present invention in an actual document;

4 is a flowchart illustrating an example of a method for setting a reference feature area according to an embodiment of the present invention;

5 to 7 are views showing examples of various methods of generating an extraction area of a digital document according to an embodiment of the present invention;

8 is a diagram showing a reference extraction area defined in a reference document according to an embodiment of the present invention by simply superimposing it on a digital document transformed in the scanning process;

9 is a view showing a modified reference extraction area according to an embodiment of the present invention to fit a digital document and superimposed on the digital document;

10 is a flowchart illustrating an example of a document recognition method according to an embodiment of the present invention;

11 is a diagram showing the configuration of an example of a document recognition apparatus according to an embodiment of the present invention.

Hereinafter, a document recognition method and an apparatus according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

1 is a diagram illustrating the concept of a document recognition method according to an embodiment of the present invention.

Referring to FIG. 1 , a document 100 such as paper is scanned through a scanner or the like and converted into a digital document 110 . Here, the digital document 110 refers to an electronic document (eg, in the form of an image file, etc.) that can be viewed on a computer. Various conventional methods for converting a paper document 100 into a digital document 110 may be applied to this embodiment.

The document recognition device converts and stores information located in all or part of the digital document 110 into computer-readable data. For example, when the digital document 110 is an image file, the document recognition device 110 extracts a partial area of the digital document, and computer-readable data such as characters or numbers existing in the extracted area. Convert to and save

2 is a diagram illustrating an example of setting a reference feature area and a reference extraction area according to an embodiment of the present invention. 3 is a diagram illustrating an example of setting a reference feature area according to an embodiment of the present invention in an actual document.

Referring to FIG. 2 , at least one

reference feature area

210 , 212 , 214 , 216 and at least one reference extraction area 220 in the reference document 200 are set. Here, the reference document 200 refers to a digital document used to set the

reference feature areas

210 , 212 , 214 , and 216 and the reference extraction area 220 . The reference document 200 may be a digital document stored as an electronic file (eg, 'MS-Word' file, etc.) or a document in the form of an image obtained by scanning a paper document and digitizing it.

The

reference feature regions

210 , 212 , 214 , and 216 are regions including at least one or more features of letters, numbers, logos, pictures, patterns, or combinations thereof existing in the basic form of the reference document 200 . For example, as shown in FIG. 3, an area including strings such as 'MEDITERANEAN', 'LADING', 'PARTICULARS', and 'Measurement' existing in the basic form of the reference document 300 is set as the reference feature area (310,312,314,316). can

The features included in each of the

reference feature regions

210 , 212 , 214 , and 216 may be different from each other. For example, if the 'abc' character string existing in several places of the basic form of the reference document 200 is defined as a characteristic constituting the reference characteristic area, a plurality of reference characteristic areas at different positions can be detected, so that the reference document 200 ) can be used for each of the

reference feature regions

210 , 212 , 214 and 216 .

As another embodiment, if the same feature exists in various positions of the reference document 200, a reference feature region composed of the same feature may be defined. For example, all of the features included in the first to

fourth feature information

210 , 212 , 214 , and 216 may be the same as the 'abc' character string.

Although this embodiment shows four

reference feature areas

210, 212, 214, and 216 located at the four corners so that the deformation of the digital document can be easily identified compared to the reference document, this is only an example. The number and location of the

reference feature areas

210, 212, 214, 216 The etc. can be variously modified according to an embodiment. For example, only the first reference feature region may exist. In this case, the deformation of the digital document compared to the reference document can be grasped using the coordinates (a1, b1, c1, d1) of the four corners of the first reference feature area.

The reference extraction area 220 is an area in which information to be extracted from a digital document is defined. Although this embodiment shows one reference extraction region 220 for convenience of explanation, the number and position of the reference extraction region 220 may be variously modified according to embodiments.

When the

reference feature regions

210 , 212 , 214 , 216 and the reference extraction region 220 are set in the reference document 200 , the document recognition apparatus detects and stores the coordinates of the corresponding regions. The document recognition apparatus recognizes at least one or more coordinates (eg, at least one or more coordinates among corners of a rectangle or at least one or more coordinates within a rectangular region, etc.) for each

reference feature region

210 , 212 , 214 , 216 and each reference extraction region 220 . can be saved.

The document recognition apparatus may provide a user interface through which the user can set the

reference feature areas

210 , 212 , 214 , 216 and the reference extraction area 220 in the reference document 200 . For example, the document recognition apparatus may display the reference document 200 on the screen and provide a tool for the user to select a specific area. The user may select the

reference feature regions

210 , 212 , 214 , 216 and the reference extraction region 220 from the reference document 200 displayed on the screen through an input device such as a mouse or a touch screen. In another embodiment, the reference feature area may be automatically defined by the document recognition device, which will be described with reference to FIG. 4 .

4 is a flowchart illustrating an example of a method for setting a reference feature region according to an embodiment of the present invention.

Referring to FIGS. 2 and 4 together, the document recognition apparatus recognizes the characteristics composed of letters, numbers, logos, pictures, patterns, or combinations thereof existing in the basic form in the reference document (S400). For example, the document recognition apparatus may identify a feature (ie, a portion corresponding to a basic form) that is common to a plurality of digital documents generated by scanning using deep learning. In another embodiment, the document recognition apparatus may recognize a predefined characteristic (eg, a barcode, a QR code, a cross-shaped alignment image, etc.) in the reference document. In another embodiment, the document recognition apparatus may limit the area for recognizing the characteristics to the corner portion of the reference document.

The document recognition apparatus sets at least one reference feature area including the identified feature (S410). And the document recognition apparatus defines at least one reference extraction area (S420). The reference extraction area may be input from the user through the user interface. The document recognition apparatus identifies and stores at least one or more coordinates for the reference feature area and the reference extraction area.

5 to 7 are diagrams illustrating examples of various methods of generating an extraction area of a digital document according to an embodiment of the present invention.

5 to 7 , the document recognition apparatus recognizes the feature areas 510 to 516,610 to 616,710 to 716 in the digital document 500,600,700. For example, when the reference characteristic information is defined as shown in FIG. 3 , the document recognition apparatus is an area in which characteristic information such as 'MEDITERANEAN', 'LADING', 'PARTICULARS', 'Measurement', etc. exists in the

digital documents

500 , 600 , and 700 . look for The document recognition apparatus searches a certain area of the digital document (500,600,700) corresponding to the coordinate values of the reference characteristic area (2210,212,214,216), and the characteristic area (510-516,610-616,710-716) in which the characteristic information exists in the digital document (500,600,700) ) can be detected quickly.

The document recognition device compares the coordinates of the reference feature regions (210,212,214,216) defined in the reference document 200 and the feature regions (510-516,610-616,710-716) extracted from the digital document 500 to determine the transformation between the two documents, , to reflect the corresponding deformation to generate the extraction areas (520, 620, 720) in which the position, inclination, size or shape of the reference extraction area 220 defined in the reference document 200 is changed. For example, the document recognition device compares the coordinates of the reference feature regions 210,212,214,216 with the coordinates of the feature regions 510-516,610-616,710-716 detected from the digital documents 500,600,700 and compares the coordinates of the reference document 200 with the digital document ( After identifying the movement, rotation, reduction, enlargement, or distortion of 500,600,700), it is reflected in the coordinates of the reference extraction area 220 to generate the extraction areas 520,620,720 for extracting the information area of the digital document 500,600,700.

Extraction areas (520, 620, 720) in which information is extracted from a digital document by identifying where the coordinates of the four corners (a5, b5, c5, d5) constituting the reference extraction area 220 of the reference document 200 are mapped to where in the digital document can figure out Since the reference document is transformed into a digital document, if you know what transformation has occurred in the digital document compared to the reference document, you can apply various conventional methods to find the mapping relationship of the coordinate values according to the transformation between images, and the coordinate transformation value of the reference extraction area can be known

As another example, after determining the relative position and size of the reference feature region based on the coordinates of the reference feature region and the coordinates of the reference extraction region, an extraction region having a relative position and size can be generated from the coordinates of the feature region of the digital document. .

An example of generating the extraction area 520 when the digital document 500 is reduced compared to the reference document 200 is shown in FIG. 5 . 5, the document recognition apparatus compares the coordinates of the

reference feature regions

210, 212, 214, 216 of the reference document 200 with the coordinates of the

feature regions

510, 512, 514,516 of the digital document 500, and the digital document 500 is the reference document ( 200), you can see how much it has been reduced.

For example, the document recognition apparatus may include a distance between the first reference characteristic information 210 and the second reference characteristic information 212 of the reference document 200 and the first characteristic information 510 and the second reference characteristic information of the digital document 500 . The horizontal reduction ratio can be grasped by using the distance ratio between the two characteristic information 512, and the distance between the first reference characteristic information 210 and the third reference characteristic information 214 of the reference document 200 and the digital A vertical reduction ratio may be determined using the distance ratio between the first characteristic information 510 and the third characteristic information 514 of the document 500 . The document recognition apparatus may generate the extraction area 520 in which the coordinates of the reference extraction area 220 of the reference document 200 are converted to fit the digital document 500 based on the horizontal and vertical reduction ratios.

For example, suppose that the digital document 5000 is reduced by 10% in the horizontal and vertical directions compared to the reference document 200, respectively. The center of the reference document 200 and the digital document 500 is the same, and the center is the origin of the coordinate system. Then, the coordinate value (x,y) of the reference extraction region 220 existing in the first quadrant of the coordinate system is converted into (x - |x * 0.1| , y - |y* 0.1|). and the center of the digital document 500 do not coincide, the difference between the two centers may be additionally reflected in the coordinate value of the reference extraction area 220 .

An example of generating the extraction area 620 when distortion occurs in the digital document 600 is illustrated in FIG. 6 . 6, the document recognition apparatus compares the coordinates of the reference feature regions 210,212,214,216 of the reference document 200 with the coordinates of the feature regions 610,612,614,616 of the digital document 600, and the digital document 600 is the reference document ( 200), you can figure out what kind of distortion it is.

For example, the document recognition apparatus includes a slope between the first reference characteristic information 210 and the second reference characteristic information 214 of the reference document 200, and the third reference characteristic information 214 and the fourth reference characteristic information ( 216), the slope between the first characteristic information 610 and the second characteristic information 612 of the digital document 600, and the slope between the third characteristic information 614 and the fourth characteristic information 616 It can be seen that there is a deformation in the form that the document becomes narrower as it goes to the right. The document recognition apparatus may generate the extraction area 620 for extracting information of the digital document 600 by reflecting the deformation on the coordinates of the reference extraction area 220 .

7 illustrates a case in which another form of distortion occurs. Referring to FIG. 7 , the document recognition apparatus compares the coordinates of the first to fourth reference

characteristic information

210 , 212 , 214 , 216 of the reference document 200 with the coordinates of the first to fourth

characteristic information

710 , 712 , 714 , 716 of the digital document 700 . Thus, it can be seen that the digital document 700 is inclined relative to the reference document 200 . The document recognition apparatus may generate the extraction area 720 by transforming the coordinate values of the reference extraction area 200 according to the corresponding deformation.

5 to 7 show some examples to help the understanding of the present invention, and the deformation of the digital document occurring in the actual digitization process may be in various forms. This embodiment compares the coordinates of the feature region detected in the digital document with the coordinates of the reference feature region of the reference document to figure out where the coordinates of the reference extraction region of the reference document are mapped in the digital document. A suitable extraction area can be created.

In another embodiment, when only one reference feature region (eg, the first reference feature region 210) is used, the coordinates (a1, b1, c1) of the four corners of one reference feature region 210 are By comparing d1) and the coordinates of the four corners of the characteristic area of the corresponding digital document, the position, inclination, size or shape of the digital document compared to the reference document is identified and reflected in the standard extraction area to determine the extraction area of the digital document. can figure out

8 is a diagram showing a reference extraction area defined in a reference document according to an embodiment of the present invention by simply superimposing it on a digital document transformed in the scanning process, and FIG. It is a drawing that is displayed overlaid on a digital document after being transformed to fit.

Referring to FIG. 8 , since the

reference extraction areas

320 , 322 , 324 , 326 , 328 and 330 are out of the information display area of the digital document 800 , necessary information cannot be accurately extracted. On the other hand, referring to FIG. 9 , necessary information can be accurately extracted using the

extraction regions

910 , 914 , 916 , 918 and 920 modified to fit the digital document 800 .

10 is a flowchart illustrating an example of a document recognition method according to an embodiment of the present invention.

Referring to FIG. 10 , the document recognition apparatus defines at least one reference feature area and at least one reference extraction area within the reference document ( S1000 ). The reference feature area and the reference extraction area may be directly selected by the user or automatically defined. An example of a method for defining a reference feature region is shown in FIG. 4 .

The document recognition apparatus detects a characteristic area of the digital document (S1010). The document recognition apparatus compares the reference feature area of the reference document with the feature area of the digital document to generate an extraction area in which the position, inclination, size or shape of the reference extraction area is changed (S1020).

The document recognition device recognizes and stores characters or numbers located in the extraction area. For example, the document recognition apparatus may automatically recognize and store characters or numbers located in the extraction area using various conventional character recognition programs such as OCR (Optical Character Recognition) technology.

Referring to FIG. 11 , the document recognition apparatus 1100 includes a reference setting unit 1110 , a feature detection unit 1120 , an extraction region determining unit 1130 , and an information recognition unit 1140 . The document recognition apparatus may be implemented as a computer including a memory, a processor, an input/output device, and the like, and each component of the reference setting unit 1110 may be implemented as software, loaded into a memory, and then executed by the processor.

The reference setting unit 1110 sets at least one reference feature area and at least one reference extraction area in the reference document. As an embodiment, the reference setting unit 1110 may provide a user interface through which the user can set the reference feature area and the reference extraction area. In another embodiment, the reference setting unit 1110 uses at least one or more of letters, numbers, logos, pictures, patterns, or combinations thereof existing in the basic form of the reference document using various conventional algorithms such as deep learning or image comparison programs. It is possible to extract the characteristic information and automatically set a plurality of characteristic areas composed of the characteristic information.

The feature detection unit 1120 detects a feature region in the digital document. For example, when the feature information defining the reference feature area is a character string or a number string, the feature detecting unit 1120 searches for an area in the digital document in which the corresponding character string or number string exists. At this time, the feature detection unit 1120 may find a part in which the same image exists by considering a string or a string of numbers in a digital document as one image, and in this case, deep learning technology is applied or various conventional algorithms for image comparison and analysis are applied. can Alternatively, the feature detection unit 1120 may use a character recognition function to determine the corresponding character string or number string.

The extraction region determining unit 1130 generates an extraction region in which the position, inclination, size or shape of the reference extraction region is changed based on the coordinate difference between the reference feature region of the reference document and the feature region detected in the digital document. Various examples of generating an extraction area of a digital document are shown in FIGS. 5 to 7 .

The information recognition unit 1140 recognizes and stores letters or numbers existing in the extraction area in the digital document. For example, if a character of 'abc' exists in the extraction area, the information recognition unit recognizes the character of 'abc' using various conventional character recognition algorithms, and then converts it into computer-readable data and stores it.

The present invention can also be embodied as computer readable program code on a computer readable recording medium. The computer-readable recording medium includes all kinds of recording devices in which data readable by a computer system is stored. Examples of the computer-readable recording medium include ROM, RAM, CD-ROM, magnetic tape, floppy disk, and optical data storage device. In addition, the computer-readable recording medium is distributed in a network-connected computer system so that the computer-readable code can be stored and executed in a distributed manner.

So far, the present invention has been looked at with respect to preferred embodiments thereof. Those of ordinary skill in the art to which the present invention pertains will understand that the present invention can be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered in an illustrative rather than a restrictive sense. The scope of the present invention is indicated in the claims rather than the foregoing description, and all differences within the scope equivalent thereto should be construed as being included in the present invention.

Claims

defining at least one reference feature area and at least one reference extraction area in the reference document;

detecting at least one feature area in the digital document;

generating an extraction region in which the position, slope, size or shape of the reference extraction region is changed based on a difference in coordinates between the reference feature region and the feature region detected in the digital document; and

and recognizing and storing information located in the extraction area in the digital document.
The method of claim 1,

The reference feature area is an area including a feature composed of at least one or more of letters, numbers, logos, pictures, and patterns existing in the basic form of the reference document,

The detecting may include detecting a plurality of feature regions by comparing the feature with the digital document.
The method of claim 1, wherein the defining step comprises:

A document recognition method comprising: setting a plurality of reference feature areas and identifying and storing the coordinates of each reference feature area, or identifying and storing a plurality of coordinates for one reference feature area.
The method of claim 1, wherein the generating of the extraction region comprises:

Comparing the coordinates of the reference feature region with the coordinates of the feature region detected in the digital document to determine movement, rotation, reduction, enlargement, or distortion of the digital document compared to the reference document;

calculating transformation coordinates for the coordinates of the reference extraction area by reflecting the identified movement, rotation, reduction, enlargement, or distortion; and

and generating an extraction region composed of the transformation coordinates.
a reference setting unit defining at least one reference feature area and at least one reference extraction area in the reference document;

a feature detection unit for detecting at least one feature region in a digital document;

an extraction region grasping unit generating an extraction region in which a position, inclination, size or shape of the reference extraction region is changed based on a difference in coordinates between the reference feature region and the feature region detected in the digital document; and

and an information recognition unit for recognizing and storing information located in the extraction area in the digital document.
The method of claim 5, wherein the extraction region determining unit,

By comparing the coordinates of the reference feature region with the coordinates of the feature region detected in the digital document, the movement, rotation, reduction, enlargement, or distortion of the digital document compared to the reference document is identified, and the identified movement, rotation or reduction or A document recognition apparatus, characterized in that by reflecting enlargement or distortion, calculating deformation coordinates with respect to the coordinates of the reference extraction area, and generating an extraction area composed of the deformation coordinates.
A computer-readable recording medium in which a program for performing the method according to claim 1 is recorded.