CA2427468A1 - Method of producing a database having input from a scanned document - Google Patents

Method of producing a database having input from a scanned document Download PDF

Info

Publication number
CA2427468A1
CA2427468A1 CA002427468A CA2427468A CA2427468A1 CA 2427468 A1 CA2427468 A1 CA 2427468A1 CA 002427468 A CA002427468 A CA 002427468A CA 2427468 A CA2427468 A CA 2427468A CA 2427468 A1 CA2427468 A1 CA 2427468A1
Authority
CA
Canada
Prior art keywords
database
database field
character recognition
optical character
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
CA002427468A
Other languages
French (fr)
Inventor
Girts Jansons
Rob Tigwell
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CA002427468A priority Critical patent/CA2427468A1/en
Publication of CA2427468A1 publication Critical patent/CA2427468A1/en
Abandoned legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/1444Selective acquisition, locating or processing of specific regions, e.g. highlighted text, fiducial marks or predetermined fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Abstract

A method of producing a database having input from a scanned document comprises the steps of performing a preliminary scan of the document to thereby produce a digital image stored in computer memory; displaying the scanned image on the computer screen; presenting on the computer screen a plurality of database field types associated with the database, for subsequent selection of one of the database field types by a user; retrieving the properties of the selected database field type; optimizing optical character recognition software according to the properties of the selected database field; accepting a user-defined area of the displayed image on the screen; performing optical character recognition on the defined area so as to convert images within the defined area into resultant text; displaying the resultant text; and storing the resultant text in the database. Steps (a) through (i) are performed as necessary, to form the database.

Description

FIELD OF THE INVENTION
[0001 The present invention relates to a method of producing a database having input from a scanned document, such as a legal document, wherein optical character recognition is used to populate the database, and more particularly to such a method wherein the optical character recognition software is optimized according to the properties of selected fields in the database.
BACKGROUND OF THE INVENTION
[0002] Many businesses, organizations, and the like, have a need to store large number of paper documents in an organized manner such that the documents can be identified and retrieved, and information on these documents can be readily found. For instance, in the legal profession, hundreds or even thousands of court documents and other related documents need to be stored for ready use during work on a trial. It is necessary that information within such documents be easily located, when needed. Without the use of a computer, it would be necessary to physically read through the documents one page at a time, in order to locate the information. Such reading of these documents is usually completely impractical as it is extremely time consuming and much of the information needed could be easily overlooked.
[0003] Alternatively, in order to permit computers to be used for searching for such information, documents can be scanned into a computer database using optical character recognition software.
This method is known to be only reasonably accurate, at best, in terms of actual character recognition, per se, and may need to be supplemented by corrections typically made by using editing software, word processing software, or the like, especially when dealing with legal documents. However, a text document that is created using optical character recognition software is essentially only a collection of words and numbers presented in a visual format of some type. There is no defined significance to the various characters, words, and numbers. For instance, a name such as John Doe may appear in the document several times, but that name might have no significance in the overall context of the document.
Accordingly, searching for information in a complete document created in its entirety by optical character recognition generally produces results that have very little meaning. In contrast, another name, such as Herman Schwartz, might be very significant, since he might be the subject of the document, the author, and so on; however, using this method, no significance is associated with the name.
[0004] It has been found that it is useful to categorize and store in a database various significant key words related to a document, such as subject, author, date, location, document type, and so on. As is well known in the prior art, significant information from the document is manually coded into the database.
In order to identify particular documents that are being sought, the database is subsequently viewed to find key words or is searched by known key words. Due to the time it takes to manually code such information into a database, this method is extremely slow and undesirable.
[0005] There is also a consideration of processing time when using optical character recognition software. A considerable amount of time can be used when scanning a large number of documents, many of which may have several dozen pages or more.
Using large amounts of time or creating information databases relating to documents is highly undesirable, as it is ultimately expensive.
[0006] One specific attempt to produce a quick and accurate method of recognizing text from scanned documents is disclosed in U. S. patent 6, 400, 845 issued June 4t'', 2002, to Volino, and entitled System and Method of Data Extraction from Digital Images. In this system and method of extraction of textual data, the digital image to be processed is first compared against master document images contained in a database. Upon determining the proper master document image, a template having predefined data zones is applied to the image to create zone images. The zone images are optically read and converted into a character file which is then parsed with the pattern to locate the text to be extracted. Upon finding data matching the pattern, that data is extracted and visible portions are used to populate data fields in the database record associated with the digital image. In other words, the optical character recognition is dependent on the template having predefined data zones. Such templates might include an invoice template, a purchase order template, a memo template, a name and address template, and so on. Configuring of an optical character recognition engine according to a known or suspected template, is of limited usefulness.
[0007] It is an object of the present invention to provide a method of producing a database having input from a scanned document.
[0008] It is another object of the present invention to provide a method of producing a database having input from a scanned document, wherein the database is populated by scanning of documents.
[0009] It is another object of the present invention to provide a method of producing a database having input from a scanned document, wherein the database is populated by means of optical character recognition of selected portions of documents.
[00010] It is another object of the present invention to provide a method of producing a database having input from a scanned document, wherein the optical character recognition engine is optimized according to a database field type relating to a selected portion of a document.
[00011] It is another object of the present invention to provide a method of producing a database having input from a scanned document, wherein the optical character recognition engine is optimized according to the records of the database.
[00012] It is another object of the present invention to provide a method of producing a database having input from a scanned document, wherein the optical character recognition engine is optimized according to the records of the database, which records represent the record history of the database.
[00013] It is another object of the present invention to provide a method of producing a database having input from a scanned document, wherein the optical character recognition engine is optimized according to the edit history of the database.
SLII~tARY OF THE INVENTION
[00014] In accordance with one aspect of the present invention there is disclosed a novel method of producing a database having input from a scanned document. The method comprises the steps of:
(a) performing a preliminary scan of the document to thereby produce a digital image stored in computer memory; (b) displaying the scanned image on the computer screen; (c) presenting on the computer screen a plurality of database field types associated with the database, for subsequent selection of one of the database field types by a user; (d) retrieving the properties of the selected database field type; (e) optimizing optical character recognition software according to the properties of the selected database field; (f) accepting a user-defined area of the displayed image on the screen; (g) performing optical character recognition on the defined area so as to convert images within the defined area into resultant text; (h) displaying the resultant text; and (i) storing the resultant text in the database. Steps (a) through (i) are performed at least one time each, as necessary, to form the database.
[00015] Other advantages, features and characteristics of the present invention, as well as methods of operation and functions of the related elements of the structure, and the combination of parts and economies of manufacture, will become more apparent upon consideration of the following detailed description and the appended claims with reference to the accompanying drawings, the latter of which is briefly described herein below.
BRIEF DESCRIPTION OF THE DRAWINGS
[00016] The novel features which are believed to be characteristic of the method of producing a database having input from a scanned document according to the present invention, as to its organization, use and method of operation, together with further objectives and advantages thereof, will be better understood from the following drawings in which a presently preferred embodiment of the invention will now be illustrated by way of example. It is expressly understood, however, that the drawings are for the purpose of illustration and description only, and are not intended as a definition of the limits of the invention. In the accompanying drawings:
[00017] Figure 1 is a flow chart of the preferred embodiment of the method of producing a database having input from a scanned document according to the present invention;
[00018] Figure 2 is a representation of a computer screen showing a scanned document;
[00019] Figure 3 is a representation of a computer screen showing the software employing the method of the present invention, with the type of database field about to be selected;
[00020] Figure 4 is a representation of a computer screen showing the software employing the method of the present invention, with a date type of database field having been selected;
[00021] Figure 5 is a representation of a computer screen as shown in Figure Q, and additionally showing the user defining an area of scanned characters for subsequent optical character recognition;
_ g _
[00022] Figure 6 is a representation of a computer screen as shown in Figure 4, and additionally showing the user defined area having scanned characters therein and the resultant text from the optical character recognition execution;
[00023] Figure 7 is a representation of a computer screen showing the software employing the method of the present invention, with a text type of database field having been selected;
[00024] Figure 8 is a representation of a computer screen as shown in Figure 7, and additionally showing the user defining an area of scanned characters for subsequent optical character recognition;
[00025] Figure 9 is a representation of a computer screen as shown in Figure 7, and additionally showing the user defined area having scanned characters therein and the resultant text from the optical character recognition execution;
[00026] Figure 10 is a representation of a computer screen as shown in Figure 9, with the resultant text from the optical character recognition execution being edited;
[00027] Figure 11 is a representation of a computer screen showing the software employing the method of the present invention, with a date type of database field having been selected;
[00028] Figure 12 is a representation of a computer screen as shown in Figure 11, and additionally showing the user defining an area of scanned characters for subsequent optical character recognition;
[00029] Figure 13 is a representation of a computer screen as shown in Figure 11, and additionally showing the user defined area having scanned characters therein and the resultant text from the optical character recognition execution;
[00030] Figure 14 is a simplified digrammatic representation of an edit history of the database field; and,
[00031] Figure 15 is a representation of a computer screen showing the software employing an alternative embodiment of the method of the present invention, showing the user defined area having scanned characters therein and the resultant text from the optical character recognition execution.

DETAILED DESCRIPTION OF THE PREFERRED AND ALTERNATIVE EMBODIMENTS
[00032] Reference will now be made to Figures 1 through 14, which show a preferred embodiment of the method of producing a database 22 having input from a scanned document 24, according to the present invention, as indicated by general reference numeral 20.
The preferred embodiment method 20 of producing a database 22 having input from a scanned document 24 comprises the steps of first performing a preliminary scan of the document to thereby produce a digital image 30, as can be best seen in Figure 2. The digital image 30 of the scanned document 24 is stored in computer memory. The method 20 of producing a database 22 having input from a scanned document 24 is embodied in the form of a software program executed on an appropriate computer, typically a microcomputer.
[00033] The scanned image 32 of the document is then displayed on the computer screen 34, as is best seen in Figure 3. The document has a document number associated with it, as shown in box 36 on the computer screen 34. The desired document number may be entered in the box 36 entitled ~~Go to Document", or alternatively the arrow buttons could be used to navigate through the document images.
Also, a list of the document numbers 37 is presented in a menu for selection by a user.
[00034] As can be seen in Figures 4 through 13, a plurality of database field types associated with the database 22 are presented on the computer screen 34. These database field types include a date database field 38, text database fields such as document type 41, title 42, summary 43, author 44, recipient 45, and location 46, and numeric database fields, which might include a page number or a monetary value 48.
[00035] The database field types are presented on the computer screen 34 for subsequent selection of one of the database field types by a user. Typically, a user would move a cursor to the desired database field on the screen, and select it by clicking a mouse button, or the like. The computer program then retrieves the properties of the selected database field type. For instance, the properties of a date database field type might include limiting characters to 1 through 12, 1 through 31, and a two digit number preceded by a "19" or a "20". The properties of a text database field type might include a maximum length, a list of preferred field content, and a history of past entries into that field. The properties of a numeric database field type might include maximum value, minimum value, and maximum length of content.
[00036] The optical character recognition software is then optimized according to the properties of the selected database field.
[00037] Reference will now be made to Figures 4 through 6, which show the selection of a date database field type. Figure 4 shows the date database field 38 being selected. Figure 5 shows a user-defined area 50 of the displayed image being created on the computer screen 34. The program then accepts this user-defined area 50 of the displayed image on the screen for use by the optical character recognition software. The next step is to performing optical character recognition on the user-defined area 50 so as to convert images 52 within the defined area into resultant text 54.
As can be seen in Figure 6, an enlarged version 56 of the captured image is displayed on the screen, for ready verification by a user that it is the desired image. The resultant text 54 is then displayed on the computer screen 34, in box 38. In this manner, the resultant text is displayed on the computer screen 34 in correlation with the presentation of the selected database field type. Further, the resultant text 54 is presented on the computer screen 34 for editing, if necessary, which will be discussed in greater detail subsequently.
[00038] The steps discussed above, from retrieving the properties of the selected database field type through storing the resultant text as records 28 in the database 22, are performed on an iterative basis to form the database 22.
[00039] The resultant text is then stored as records 28 in the database 22, as can be best seen in Figure 1. The records 28 stored in the database 22 represent the record history of the database field. In subsequent occurrences of the same type of database field type, such as the numeric database field type shown in Figures 4 through 6, the optical character recognition software is optimized according to the appropriate records 28 stored in the database 22. This type of optimization is also performed on an iterative basis with the above steps to form the database 22.
[00040] Reference will now be made to Figures 7 through 10, which show the selection of a date database field type. Figure 7 shows the text database field 60 being selected, which is a title field 42. Figure 8 shows a user-defined area 61 of the displayed image being created on the screen. The program then accepts this user-defined area 61 of the displayed image on the computer screen 34 for use by the optical character recognition software. The next step is to performing optical character recognition on the user-defined area 61 so as to convert images 62 within the user-defined area 61 into resultant text 64. As can be seen in Figure 9, an enlarged version 66 of the captured image is displayed on the computer screen 34, for ready verification by a user that it is the desired image. The resultant text 64 is then displayed on the computer screen 34, in box 60. In this manner, the resultant text 64 is displayed on the computer screen 34 in correlation with the presentation of the selected database field type. As is best seen in Figure 10, it is sometimes necessary to edit the resultant text presented on the computer screen 34. The term "AFFIDAvrr OF
DOCUMENTS" in box 60, which was original resultant text generated by the optical character recognition software, has been amended to read "AFFIDAVIT OF DOCUMENTS".
[00041] The resultant text is then stored as records 28 in the database 22, as can be best seen in Figure 1. The records 28 stored in the database 22 represent the record history of the database field. In subsequent occurrences of the same type of database field type, such as the numeric database field type shown in Figures 7 through 11, the optical character recognition software is optimized according to the appropriate records 28 stored in the database 22.
[00042] The steps discussed above with reference to Figures 7 through 10, are performed on an iterative basis to form the database 22.
[00043] Reference will now be made to Figures 11 through 13, which show the selection of a numeric database field type. Figure I1 shows the numeric database field being selected, which is the monetary value 48. Figure 12 shows a user-defined area 71 of the displayed image being created on the computer screen 34. The program then accepts this user-defined area 71 of the displayed image on the computer screen 34 for use by the optical character recognition software. The next step is to performing optical character recognition on the defined area so as to convert images 72 within the defined area 71 into resultant text 74. As can be seen in Figure 13, an enlarged version 76 of the captured image is displayed on the computer screen 34, for ready verification by a user that it is the desired image. The resultant text is then displayed on the computer screen 34, in box 48. In this manner, the resultant text is displayed on the computer screen 34 in correlation with the presentation of the selected database field type.
[00044] The resultant text is then stored as records 28 in the database 22, as can be best seen in Figure 1. The records 28 stored in the database 22 represent the record history of the database field. In subsequent occurrences of the same type of database field type, such as the numeric database field type shown in Figures 11 through 13 the optical character recognition software is optimized according to the appropriate records 28 stored in the database 22.
[00045] The steps discussed above with reference to Figures 11 through 13, are performed on an iterative basis to form the database 22.
[00046] As can be best seen in Figure 14, the present invention also permits creation of an edit history 80 of the database field, on an iterative basis with the other steps described above, as editing of the resultant text is done. This edit history 80 is preferably stored as a text file associated with the file and the database 22. The optical character recognition software may also be optimized according to the edit history.
[00047] Reference will now be made to Figure 15, which shows an alternative embodiment of the method of producing a database having input from a scanned document according to the present invention, as indicated by general reference numeral 120. In the alternative embodiment method of producing a database having input from a scanned document, as is seen in Figure 15, an optional list, as indicated by general reference numeral 122, of words similar to a chosen word from the resultant text can be presented on the computer screen 34. These similar words are to assist in choosing the proper word for the resultant text, and may also preclude a user from having to type in corrections. This list of words is related to the selected database field, so as to provide a high degree of accuracy. These similar words have either been previously entered into the appropriate database field or have been entered in the appropriate database field by the software program embodying the present invention. These words may also be stored in a list associated with the appropriate database field, for the purpose of ready review by a user. These similar words are displayed on the computer screen for selection by a user, via a drop-down list.
[00048] As can be understood from the above description and from the accompanying drawings, the present invention provides a method of producing a database having input from a scanned document, a method of producing a database having input from a scanned document, wherein the database is populated by scanning of documents, a method of producing a database having input from a scanned document, wherein the database is populated by means of optical character recognition of selected portions of documents, a method of producing a database having input from a scanned document, wherein the optical character recognition engine is optimized according to a database field type relating to a selected portion of a document, a method of producing a database having input from a scanned document, wherein the optical character recognition engine is optimized according to the records of the database, a method of producing a database having input from a scanned document, wherein the optical character recognition engine is optimized according to the records of the database, which records represent the record history of the database, and a method of producing a database having input from a scanned document, wherein the optical character recognition engine is optimized according to the edit history of the database, all of which features are unknown in the prior art.
[00049] Other variations of the above principles will be apparent to those who are knowledgeable in the field of the invention, and such variations are considered to be within the scope of the present invention. Further, other modifications and alterations may be used in the design and implementation of the method of producing a database having input from a scanned document according to the present invention without departing from the spirit and scope of the accompanying claims.

Claims (15)

I CLAIM:
1. A method of producing a database having input from a scanned document, said method comprising the steps of:
(a) performing a preliminary scan of said document to thereby produce a digital image stored in computer memory;
(b) displaying the scanned image on said computer screen;
(c) presenting on said computer screen a plurality of database field types associated with said database, for subsequent selection of one of said database field types by a user;
(d) retrieving the properties of the selected database field type;
(e) optimizing optical character recognition software according to the properties of the selected database field;

(f) accepting a user-defined area of the displayed image on said screen;
(g) performing optical character recognition on the defined area so as to convert images within the defined area into resultant text;
(h) displaying the resultant text; and, (i) storing the resultant text as records in said database;
wherein steps (d) through (i) are performed on an iterative basis to form said database.
2. The method of claim 1, further comprising the step of:
(e') optimizing said optical character recognition software according to records stored in said database;
wherein step (e') is performed on an iterative basis with steps (d) through (i) to form said database.
3. The method of claim 2, wherein said records stored in said database represent the record history of said database field.
4. The method of claim 1, further comprising the step of:
(j) creating an edit history of said database field.
5. The method of claim 4, wherein step (j) is performed on an iterative basis with steps (d) through (i) to form said database
6. The method of claim 4, wherein said edit history comprises a text file.
7. The method of claim 4, further comprising the step of:
(e'') optimizing said optical character recognition software according to said edit history;

wherein step (e'') is performed on an iterative basis with steps (d) through (i) to form said database.
8. The method of claim 1, wherein said type of database field is a numeric database field.
9. The method of claim 1, wherein said type of database field is a date database field.
10. The method of claim 1, wherein said type of database field is a text database field.
11. The method of claim 1, wherein said resultant text is presented on said screen for editing.
12. The method of claim 1, further comprising the step of:
(h') presenting on said computer screen a list of words similar to a chosen word from said resultant text;

wherein step (h') is performed on an iterative basis with steps (d) through (i) to form said database.
13. The method of claim 12, wherein said list of words is related to the selected database field.
14. The method of claim 1, further comprising the step of:
(b') presenting a list of document numbers for selection by a user.
15. The method of claim 1, wherein in step (h), the resultant text is displayed on said computer screen in correlation with the presentation of the selected database field type.
CA002427468A 2003-05-02 2003-05-02 Method of producing a database having input from a scanned document Abandoned CA2427468A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CA002427468A CA2427468A1 (en) 2003-05-02 2003-05-02 Method of producing a database having input from a scanned document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CA002427468A CA2427468A1 (en) 2003-05-02 2003-05-02 Method of producing a database having input from a scanned document

Publications (1)

Publication Number Publication Date
CA2427468A1 true CA2427468A1 (en) 2004-11-02

Family

ID=33315225

Family Applications (1)

Application Number Title Priority Date Filing Date
CA002427468A Abandoned CA2427468A1 (en) 2003-05-02 2003-05-02 Method of producing a database having input from a scanned document

Country Status (1)

Country Link
CA (1) CA2427468A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8682075B2 (en) 2010-12-28 2014-03-25 Hewlett-Packard Development Company, L.P. Removing character from text in non-image form where location of character in image of text falls outside of valid content boundary

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8682075B2 (en) 2010-12-28 2014-03-25 Hewlett-Packard Development Company, L.P. Removing character from text in non-image form where location of character in image of text falls outside of valid content boundary

Similar Documents

Publication Publication Date Title
US8139870B2 (en) Image processing apparatus, recording medium, computer data signal, and image processing method
US9092417B2 (en) Systems and methods for extracting data from a document in an electronic format
CN100414549C (en) Image search system, image search method, and storage medium
US7636886B2 (en) System and method for grouping and organizing pages of an electronic document into pre-defined categories
US6031625A (en) System for data extraction from a print data stream
JP3425408B2 (en) Document reading device
US20030042319A1 (en) Automatic and semi-automatic index generation for raster documents
US9558234B1 (en) Automatic metadata identification
US8565526B2 (en) Method and system for converting image text documents in bit-mapped formats to searchable text and for searching the searchable text
US20050171965A1 (en) Contents reuse management apparatus and contents reuse support apparatus
US7853595B2 (en) Method and apparatus for creating a tool for generating an index for a document
US5563997A (en) Method and apparatus for sorting records into a list box in a graphic user interface
US5895473A (en) System for extracting text from CAD files
US20100217717A1 (en) System and method for organizing and presenting evidence relevant to a set of statements
JP2006091994A (en) Device, method and program for processing document information
US8612431B2 (en) Multi-part record searches
EP1256900A1 (en) Database entry system and method employing optical character recognition
CA2427468A1 (en) Method of producing a database having input from a scanned document
JPH117452A (en) Method and device for collecting information through network and recording medium recording program for executing the method
JP3335863B2 (en) Apparatus and method for simplifying character input
JP7377565B2 (en) Drawing search device, drawing database construction device, drawing search system, drawing search method, and program
JP2003058559A (en) Document classification method, retrieval method, classification system, and retrieval system
US20040164989A1 (en) Method and apparatus for disclosing information, and medium for recording information disclosure program
KR101057997B1 (en) Search engines and search methods using initial text
JP2009098764A (en) Action management system and action management method

Legal Events

Date Code Title Description
FZDE Discontinued