WO2007050372A2 - Document recognition method - Google Patents
Document recognition method Download PDFInfo
- Publication number
- WO2007050372A2 WO2007050372A2 PCT/US2006/040619 US2006040619W WO2007050372A2 WO 2007050372 A2 WO2007050372 A2 WO 2007050372A2 US 2006040619 W US2006040619 W US 2006040619W WO 2007050372 A2 WO2007050372 A2 WO 2007050372A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- document
- content
- word
- immutable
- word list
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
Definitions
- the present invention relates to delivery of documents and data from and between client computers and a server and a computer network.
- Facsimile are widely used at the present time for distribution of simple documents but facsimile transmission has numerous
- facsimile documents are stored in a raster
- facsimile transmissions take a relatively long time to complete, particularly for
- Facsimile transmissions may be
- Electronic mail permits a document to be transmitted
- documents delivered via electronic mail may be printed at the same quality as they are transmitted, and may be edited by the recipient using same software application in which they were generated by the sender.
- documents formatted for one application may be converted to another format for use by another application for editing.
- This software or hardware captures a Printer Control Language file generated by the legacy
- the Printer Control Language file may then be sent to a printer at the recipient computer system with the same quality as can be achieved at the sender.
- Printer Control Language files are not word processing formats
- Printer Control Language files cannot be readily edited by the recipient. Furthermore, the recipient cannot readily extract data
- Control Language file delivered in accordance with the methods described above. Specifically, the file can be printed, and can be visually displayed. If the recipient desires to categorize the received file or extract data from it, the
- the purpose is to include identifying
- the cover page is then scanned by a recipient computer system to identify the form and data
- cover page is then removed and the remainder of form is printed or displayed.
- the cover page is presented in a
- the invention provides a method of recognizing the content of a document as part of electronic delivery thereof,
- the immutable content is identified by recognizing graphical boxes included in the document.
- graphical boxes are recognized by dividing said document into rectangles consistently with the positions of graphic lines in the document.
- the documents at issue are home mortgage transaction documents generated from standard
- received i.e., it is in the form of a PCL, PostScript, PDF or raster image.
- generating a word list for the document involves matching a character map embedded within the document to graphical content of the
- the document can be recognized as a relative of the known document
- the mutable content of the document may identified based upon the
- documents descriptive of form may be generated by capturing the word list for a "blank" of each form in use. More robustly, known document word lists may
- the common subsequences can be generated by a process that identifies common word subsequences in a plurality of documents using the same form.
- the common subsequences can be generated by a process that identifies common word subsequences in a plurality of documents using the same form.
- Fig. 1 is a diagram of a network of computers including a
- Fig. 2 A is an illustration of a HUD-I settlement statement
- Fig. 2B is an illustration of a sequence of form documents
- Fig. 3 is an process for identifying mutable and immutable text on a primarily graphically defined document and generating a data file representing the data on the form;
- Fig. 4 is a process for identifying rectangles identifying sections
- Fig. 4A is an illustration of horizontal correction of line intersections in the process of Fig. 4;
- Fig. 4B is an illustration of vertical correction of line intersections in the process of Fig. 4;
- Fig. 4C is an illustration of exemplary documents containing
- Fig. 5 is an illustration of a page matching process for
- Fig. 6 is an illustration of an enhanced page matching process for identifying a form from text thereon and extracting data from the form.
- FIG. 1 a network of computers for carrying out the principles of the present invention can be described. At heart of this
- computer network is a server 10 which operates in conjunction with a mass
- mass storage 11 includes received documents (which are typically filled-out forms and will be identified from time to time in the following as "forms"), template
- Server 10 interacts with a plurality of remotely located client computers via a network
- Fig. 1 the clients illustrated in Fig. 1 and are those typically involved with the real estate purchase transaction, although
- real estate purchase/refinance and mortgage forms are initially generated by a lender bank's home office, at a
- lender bank home office computer 12 may be a legacy computer system, such as a mainframe computer
- Forms generated by client computers and, such as the bank home office computer 12, may take a variety of forms.
- Forms managed by the system may be highly graphical forms 14, including graphical features such as
- HUD-I form promulgated by the United States Department of Housing and Urban Development, which is used
- the HUD-I form includes graphical boxes and highly formatted arrangements of information to present the purchaser, seller and lender details, and financing details on a real estate purchase
- a box includes an identifying label such as "D. NAME AND ADDRESS OF BORROWER" and textual content that corresponds to that label.
- the position of data for the form is typically consistent with the labels, but there may be special cases. For example, there is no box that is explicitly labeled to include the seller's Social Security number, but this
- box 36 which bears the label "G. PROPERTY LOCATION”.
- box 38 in some cases this information is positioned in box 38 which bears the label "E. NAME AND ADDRESS OF SELLER”.
- box 40 is labeled "I.
- this box may list both the settlement date and the
- data is presented in a box that is adjacent to the box that labels that data.
- the financial institution is presented in a box that is adjacent to the box that labels that data.
- Contract Sales Price does not itself identify the contract sales price, but rather
- the contract sales price is identified in a box immediately adjacent and to the right thereof. The same pattern repeats elsewhere in area 42 of the form.
- forms may including lengthy text, e.g., legal recitations, which must be acknowledged by a buyer's or seller's or
- lender's signature such as represented at 16.
- Forms of this type are less elaborately formatted than HUD-I type forms 14, but such forms nevertheless
- Such forms typically include a blank for signature by
- a third type of form includes even fewer customized fields, such as a disclosure document for providing information regarding a transaction to, typically, the buyer of real estate. Such forms are represented at 18, and may include customized content relating to transaction, such as the buyers or sellers name or the purchase price,
- a PC desktop program is used to capture the print stream from a legacy computer system and package that print stream for uploading to server 10, as described in U.S. Patent Application 10/702,204
- the print stream from the legacy computer system may include several sequential pages 46a, 46b, 46c, which may collectively constitute a single form, or may represent multiple forms.
- Forms generated by an originating client computer may be delivered to the server 10 in a variety of potential formats.
- PCL includes
- Printer Control Language is
- PostScript printer driver output may be transmitted from client computer 12 to server 10, and then utilized to deliver form documents to other client computers.
- PostScript is a page description language that differs from PCL in a number of manners, and is somewhat less
- PostScript has wide enough acceptance that PostScript drivers are available for most computer systems and most software packages used with those computer systems. In the event that such drivers are
- PCL may be converted to PostScript, using library utilities such as "JETPCL".
- PostScript is an advantageous format for presenting graphics as compared to Printer Control Language. Furthermore, a derivative of
- PostScript is utilized by the portable document format (PDF) popularized by Adobe Systems and used with the Adobe programs known as Acrobat and
- PostScript formatted forms therefore, are more readily convertible to Adobe Acrobat format for delivery in PDF form, using library
- PostScript maybe more readily utilized to extract text from a document
- PostScript formatted documents include a character encoding table that readily permits parsing of the document to identify the graphics therein
- Printer Control Language also includes a character encoding table, character encoding in PCL is typically
- PostScript character encoding maybe more easily managed, a PostScript document maybe more readily scanned to determine all of the characters utilized in that document.
- Such a scanning function is known to those of skill in the
- the text on a form may be extracted and converted to a word list, with each word having a page location, permitting processing of those words to determine words that are customized form data and also permitting those words which are not customized form data to be matched against templates to
- server 10 determines which form is being utilized. Returning to Fig. 1, when server 10 receives form content such as a form 14, 16 or 18, server 10 stores those received forms in mass storage
- Forms may be received in PostScript format, or the reception by server 10 may involve conversion of forms received in PCL format to PostScript. Then,
- server 10 processes those forms utilizing a program in server 10 and/or template data in mass storage 11, as described in the various embodiments reviewed below.
- the outcome of this processing is an identification of the
- This data is stored in mass storage device 11 in a retrievable data format such an extensible
- MIMO Standards Management
- an attorney involved in the transaction working in an attorney computer 22 may desire to download an XML / MISMO SmartDoc version of
- an attorney may wish to download the HUD- 1 form 14 itself for printing at the attorney's computer 22. Accordingly the attorney may download a version 14' of the HUD-I form that has been
- server 10 such as a version converted from PCL to PostScript format, and then to portable document format or PDF.
- XML or MISMO SmartDoc formatted documents such as 20 including financial details of the transaction or name and address
- a buyer utilizing a buyer's computer 26 may receive some or all of the forms generated by the bank home office 12, in this case including not only the HUD-I form 14 but also a mortgage paper or
- a first step 100 in this process is to generate a word list for the
- a word list may take a number of forms depending upon the format in which the document is made available. For example, a PostScript formatted document may be decoded in step 102 by utilizing the
- the document may be a raster graphic image such as a facsimile
- character recognition techniques 104 may applied to the document to identify characters, and then
- format may include a character table, that may be decoded to identify words and characters therein, although the methodology utilized may differ from the
- the form recognition algorithm identifies the immutable text and graphic content in the form. This step may take a variety of possible forms depending
- the process of locating immutable text in the form includes recognition of the form from among a plurality of candidates.
- the form in use is known in advance, and processing may
- the process illustrated in Fig. 3 includes an optional step 108 of recognizing the form based upon its immutable content, which is performed only in those
- the mutable content of the form i.e., the information filled in the blanks of the form, is identified, and its meaning is determined based upon its position relative to the immutable content.
- Various embodiments of the invention perform this step in different ways. For
- the mutable text may be recognized by its proximity and positioning
- mutable text may be recognized from its position
- mutable content its meaning may be determined based upon its positioning in the form, or any other relationship it has with immutable content.
- the meaning of the mutable content may be determined from the text
- a data file such as an XML, MISMO
- SmartDoc, or other data file is stored containing the mutable text and the
- variable names identifying the meanings of the mutable text, for later use as discussed above with reference to Fig. 1.
- a rectangle recognition algorithm 120 is utilized to divide the surface of the form into rectangles, which are then used for recognition of text and the meaning of that text on the form. Prior to recognizing rectangles, the algorithm of Fig. 4 corrects the positioning of
- the algorithm will use the midline 124 of a horizontal line and the midline 126 of a vertical line. It will be appreciated, however, that
- graphically defined lines include a finite width as seen at 125 and 127.
- the finite width of the graphically defined line adds a potential ambiguity and
- midline 124 is defined to overlap with line 126, within the width 127 of the printed appearance of line 126.
- Lines 124 are defined to overlap with line 126, within the width 127 of the printed appearance of line 126.
- midline 124 but it does not intersect at a T-intersection with midline 126.
- the overlap of midline 124 with midline 126 will not be visible on a printed
- midline 124' does not in intersect midline 126' even though the printed width of line 124', as seen at 125', will overlap with the printed width of line 126', as seen at 127'.
- a vertical line 132 having a printed width shown at 133 may not properly intersect with a horizontal line 134 having a printed width 135.
- line 132 may extend beyond line 134 rather than forming a proper T-intersection, and again,
- Fig. 4A must be corrected prior to identifying rectangles on a form, to
- step 122 a horizontal correction is performed at each line intersection.
- each vertical line is extended by a
- each horizontal line is analyzed to determine whether its vertical position is within the vertical range of the
- the rectangle recognition algorithm of Fig. 4 also includes step
- each vertical line is evaluated to determine whether it is close to forming a T-intersection with a horizontal line
- the vertical line is extended or shortened so that the
- each horizontal line is extended by a
- each vertical line is analyzed to determine whether its horizontal position is within the horizontal range of the
- the top of the vertical line is set equal to the vertical position of the horizontal line, and, if the bottom of the
- the end point of the vertical line is changed to equal the vertical position of the horizontal line, so that the end point of the vertical line exactly corresponds to
- step 142 The initial vertical position at the beginning of the process, is set in step 142 to be
- step 144 a data structure is created to
- Open rectangles are defined by a data structure identifying the left and right edges of the rectangle, and the vertical position at which the rectangle starts.
- margins are treated as rectangle boundaries so that each page will contain at least one rectangle. Boxes within the page will create further rectangles, as discussed below.
- step 145 it is determined whether any non-checkbox vertical lines end(i.e., have their bottom end) at the current vertical position on the form.
- Non- checkbox vertical lines are vertical lines that are not part of checkbox, i.e.,
- box 32 (Checkboxes are rectangles or squares that have a predetermined maximum size, and are identified in this manner, and ignored during rectangle recognition.) If a non-checkbox vertical line end at the current vertical
- step 146 the open rectangles that join at
- one rectangle is opened, by forming data structures representing the rectangles in step 147.
- the rectangle opened is formed by forming data structures representing the rectangles in step 147.
- step 147 has its respective right and left edges at the horizontal positions of the left and right edges, respectively, of the closed rectangles to the left of and to the right of the ending vertical line.
- the ending of a vertical line will merge open rectangles by replacing them with one open rectangle beginning at
- processing After thus creating a new rectangle at the ending position of the vertical line, processing returns to step 145 at which is determined whether there is another vertical line ending at the current vertical position. If so, then
- step 148 any non-checkbox vertical lines that start (i.e., have their top end) at the current vertical position on the form.
- step 149 the open rectangle that includes the
- step 150 have their respective right and left edges, respectively, at the horizontal position of the vertical line, and have their left and right edges, respectively, at the left and right edges of the open rectangle that was closed in
- step 149 the presence of a vertical line will split an open rectangle by
- processing After thus creating new rectangles at the position of the starting vertical line, processing returns to step 148 at which is determined whether
- step 145 When all vertical lines ending and starting at the current vertical position have been processed through steps 145 and 148, then in step
- step 154 it is determined whether there is a
- step 146 in which any vertical lines at the current vertical position are assessed to potentially break the rectangle created in step 144 into smaller rectangles reflecting the presence
- step 154 if it is determined that there is no horizontal line at the current vertical position, then processing proceeds
- step 154 directly from step 154 to step 146 to evaluate whether there are vertical lines at the current vertical position that will require division of currently opened rectangles into smaller rectangles.
- step 152 At this point, there are no additional vertical positions to evaluate in the form, and the process of Fig. 4 proceeds to step 160.
- Fig. 4C provides exemplary illustrations of the manner in
- Form 158 illustrated in Fig. 4C includes a single box generally
- This form 158 will be divided by the process of Fig. 4 into five rectangles.
- a first rectangle 158-1 is defined in the area extending from the left, right and top margin to the horizontal line that defines the top of the box in the form 158.
- rectangles 158-2, 158-3 and 158-4 are defined in the horizontal region where the box is positioned on the form, one to the left of, one to the right of, and one corresponding to the box. Finally, a fifth rectangle 158-5 is defined for the area extending from the horizontal line defining the bottom of the box, to
- Form 159 is a more elaborate form that includes 14 adjacent boxes of irregular sizes. This form will be divided by the process of Fig. 4
- Step 160 and the following steps recognize and associate text in the form with fields, on a rectangle by rectangle basis, thus capturing the data
- step 162 a rectangle of the form is
- the rectangle may
- ⁇ include the known field name "D. NAME AND ADDRESS OF BORROWER:" If the rectangle is thereby identified, the rectangle may also contain mutable text which identifies the values for a field.
- immutable text in a rectangle may not always be simple - a social security
- a disbursement date may be included in a rectangle having the immutable text "G. PROPERTY LOCATION" and a disbursement date may be included in a
- steps 160 and the following must be designed to flexibly determine the mutable text once the immutable text has been identified, by for example
- step 164 processing continues to step 164 and then to step 166 in which any text within
- mutable text e.g., formatted as a social security number
- mutable text e.g., formatted as a social security number
- a rectangle may not contain mutable text, but may include a reference to an adjacent rectangle containing mutable text.
- the rectangle may contain the immutable text "101. Contract Sales Price", and be adjacent to another rectangle which contains the dollar figure
- step 168 processing continues through step 168 to step 170 in which the text from the appropriate adjacent rectangle, selected based upon the immutable text, is extracted and assigned to the appropriate variable, e.g. the variable that is identified by the immutable text in the current
- Some rectangles may contain check boxes, and the processing of such rectangles in steps 164 and 166 is slightly different, hi this case the
- mutable text is analyzed to determine whether there is mutable text or graphic
- checkboxes if so, then the immutable text adjacent to that checkbox is used to identify the meaning of the selected checkbox(es).
- rectangles may not contain any text or may not contain mutable text which is associated with a value.
- the rectangle may not contain any text or may not contain mutable text which is associated with a value.
- the rectangle may not contain any text or may not contain mutable text which is associated with a value.
- the rectangle may not contain any text or may not contain mutable text which is associated with a value.
- the rectangle may not contain any text or may not contain mutable text which is associated with a value.
- the rectangle may be any text or may not contain mutable text which is associated with a value.
- steps 162 through 170 is performed for each
- immutable elements and mutable text in those rectangles using the immutable text as a guide to the location and meaning of the mutable text. While this is,
- word list from an unknown received document, and compares it to the word
- nonmatching (mutable) words to determine which of the known forms is the closest match, and identify the meaning of the mutable and immutable text and data on the form.
- a word list, to be identified as "TEST”, is extracted from the document to be matched.
- this step typically involves using a library function that uses the character map in the document
- format documents may be converted to a word list without requiring recognition of characters. Fax-formatted or other graphics formatted
- OCR optical character recognition
- incoming documents may be multi-page, and those multiple pages
- the wordlist (i.e., "TEST") extracted from the incoming document, may be initially generated from only the first page only of a multi- page print stream or document. If a word list extracted only from the first page
- a word list extracted from the first two pages may be compared to known forms, and so on. Once a set of pages in an incoming document is successfully matched to a known form, then the remaining pages may be processed by the same procedure, starting with a
- this wordlist is compared to wordlists from each of several candidate forms that may be matchable to the incoming document.
- wordlists are generated in advance, for example by causing the source
- a "blank” form i.e., a form that contains all of the immutable content of a form, but does not contain any mutable content, and then converting the resulting "blank” form to a wordlist.
- a "blank” form i.e., a form that contains all of the immutable content of a form, but does not contain any mutable content
- known wordlists may be generated by causing a
- client system to output wordlists for multiple versions of a given form (i.e.,
- the server 10 may retain several recent versions of a recognized form, e.g., as those forms
- server 10 are forwarded through server 10, and periodically use the retained recent versions to rebuild a known wordlist representing the largest common subsequence of words in those forms, which can replace the known wordlist currently in use if any changes are noted.
- step 202 is performed one wordlist at a time. Specifically, in step 202,
- a known wordlist "KNOWN" that has not previously been evaluated is selected from the available pool of known wordlists.
- pointer variables x and y are initialized to zero and a temporary file is created for storing a matched version of the incoming wordlist TEST. Then, a loop of
- steps 206-214 is entered, which collectively compare words in TEST to words
- step 206 a current word in TEST (at position x) is compared to a current word in KNOWN (at position y). If these two current words
- step 208 the pointers x and y are incremented (skipping to the
- next words in TEST and KNOWN next words in TEST and KNOWN
- the current word in TEST is stored in the temporary file as a matched, apparently immutable word. If, however, the current words do not match, then in step 210 the pointer x is incremented (skipping to the next word in TEST), and the current word in TEST is stored in the temporary file as an unmatched, apparently mutable word. Thereafter, in steps 212 and 214 it is evaluated whether the end of the TEST or KNOWN wordlists have been reached, respectively, and if not, processing returns to step
- processing will proceed from either step 212 or 214, respectively, to step 216.
- the pointer y will identify the number of words in
- step 218 the temporary file storing the matched version of TEST, is stored as a current best match. In any case, in step 220, it is determined
- step 222 the temporary file is erased for the next iteration, and processing
- step 220 After all candidate known wordlists have been evaluated, processing continues from step 220 to step 224, in which the best match file
- the matched wordlist include a limited number of unmatched / apparently mutable words, or include matching for all or
- processing continues to step 226, and the mutable text is extracted from the best match file, and its meaning recognized based upon its proximity to immutable text, in a manner analogous to the processing
- step 228 a matching failure is returned in step 228. This may end the matching process, or as discussed above may cause the matching
- the wordlist to be tested includes the following:
- immutable word is followed in the test word list by a mutable word "Tim"
- each known word list includes the immutable word "Ext.”, which is matched by the process of Fig. 5 to the word "Ext.:” in the TEST word list.
- word list includes the mutable word "xl23" after "Ext.” (which would be implied to be a telephone extension). Furthermore, it can be seen that the third
- test word list After "Title” the TEST word list includes "Programmer
- the matching algorithm is capable of identifying mutable and immutable text and permitting
- forms typically include all of the immutable content of the "blank" form, plus additional mutable content, it is typical for forms to be updated from time to
- this updating may involve deleting immutable content from the
- a field may be deleted or the immutable words in that field may be simplified by deletion of words.
- a "blank" form rather than being blank, may include placeholder words at the
- placeholder words e.g., unique gibberish words such as zzyzz, zyzzz, etc.
- This approach might help to prevent immutable words of a known document from being confused with mutable words of a received document - for example, the last name of the borrower identified on a mortgage transaction form might be "Borrower", a mutable
- each table has a number of rows equal to the number of words in TEST (which is stored as "m” in step 254), and a number of columns equal to the number of words in KNOWN (which is stored as "n" in
- step 254 The first table, known as bTABLE (which is dimensioned in step
- cTABLE also dimensioned in step 256
- step 258 the cTABLE is initialized by setting the values in its 0 th row and 0 th column to 0 (representing that no common subsequences are
- step 260 the loop variables i and j are
- a first step 262 it is determined whether the current word in
- TEST (at position i) is the same as the current word in KNOWN (at position j). If so, then any common subsequence that has previously been identified in
- step 264 the value in cTABLE at (i-lj-1), which represents the length of
- step 262 In the event that word i of TEST and word j of KNOWN do not match in step 262, processing continues to step 266, in which a test is
- step 266 the value of
- step 268 the value of cTABLE at entry (i-lj) is stored in cTABLE at the entry (i j), thus reflecting that the longest
- subsequence has the same length after word i of TEST as before word i of TEST (because word i of TEST did not match word j of KNOWN). Also in
- step 268 the symbol " ⁇ " is placed in bTABLE at entry (i j), reflecting that the longest known subsequence at word i of TEST and word j of KNOWN
- step 266 If the comparison of step 266 is false, this indicates that the
- step 270 the value of cTABLE at entry (ij-1) stored in cTABLE at entry (ij), thus reflecting that the
- step 270 the symbol " ⁇ " is placed in bTABLE at entry (i j), reflecting that the longest known subsequence at word i of TEST and
- word j of KNOWN continues from word i of TEST and word j-1 of KNOWN.
- step 272 it is determined whether the end of the TEST wordlist has been reached. If not, then the pointer i is incremented in step 274, to begin comparison of the next word of TEST with the current word of KNOWN in the above-described manner, and processing
- step 276 it is determined whether the end of the KNOWN wordlist has been reached. If not, then the pointer j is incremented in step 278, and the pointer i is reset to 1, to begin comparison of the first word
- step 280 the cTABLE and bTABLE are evaluated to determine the quality of the match between TEST and the currently selected KNOWN wordlist. Specifically, the largest value in the cTABLE represents
- step 280 the largest value in the cTABLE is larger than the
- step 282 the cTABLE and bTABLE are stored as the current best match.
- step 284 it is determined whether there are further known wordlists to compare to the TEST wordlist, and if so,
- processing returns to step 252 to select a remaining known wordlist for
- step 284 processing continues from step 284 to step 286, in which the best match
- the matched wordlist include a limited number of unmatched / apparently mutable words, or include matching for all or substantially all of the words in the known wordlist. If these criteria indicate a
- step 290 a matching failure is returned in step 290. This may end the matching process, or as discussed above may cause the matching
- the wordlist to be tested includes the following:
- TEST Name: Tim Ext.: xl23 Title: Programmer Spvsr.: Phil Three known wordlists are matched to this test word list, as follows:
- KNOWN #2 Name: Ext.: Fax: Home: Title: Sprvsr.:
- KNOWN #2 Name: Ext.: Fax: Home: Title: Sprvsr.:
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Character Input (AREA)
- Character Discrimination (AREA)
Abstract
The content of a document is recognized as part of electronic delivery thereof, by generating (100) a word list for the document, recognizing (106) immutable content of the document that corresponds to a form used in generating the document, identifying (112) meanings for mutable content of the document based upon positions of mutable content relative to immutable content, and then storing (114) the mutable content in association with the identified meanings thereof for subsequent retrieval.
Description
ATTORNEY DOCKET NO: ELYN/11
DOCUMENT RECOGNITION METHOD Field of the Invention
The present invention relates to delivery of documents and data from and between client computers and a server and a computer network.
Background of the Invention
Modern technology has developed a number of methods for
delivering documents between a sender and a recipient that are alternative to
the traditional physical delivery of paper copies. One older example is the use of facsimile technology. Facsimile are widely used at the present time for distribution of simple documents but facsimile transmission has numerous
drawbacks. First, the quality of the printed document at the recipient is low,
and clearly a reduction from the original. This may result in a loss of content or at least readability. Furthermore, facsimile documents are stored in a raster
graphic form and cannot be easily edited. For example, unless a complex conversion is performed, the recipient cannot change the text with a text
editor, or move the graphics on the document, and the like. Furthermore, facsimile transmissions take a relatively long time to complete, particularly for
long elaborate and complex documents. Facsimile transmissions may be
stored electronically so that they may be preserved, printed multiple times, or forwarded electronically, but such uses of facsimile transmission data do not
overcome the difficulties of quality, presentation and time of transmission that are typical of facsimile methods.
Another widely used technology for distributing documents is
electronic mail. Electronic mail permits a document to be transmitted
electronically from one computer to another, and offers the advantages of convenience, electronic storage, delivery of document in its native electronic
format. As a consequence, documents delivered via electronic mail may be printed at the same quality as they are transmitted, and may be edited by the recipient using same software application in which they were generated by the sender.
Although electronic mail is thus preferable in many ways to
facsimile transmission, electronic mail has drawbacks of its own. Specifically,
electronic mail is not secure, and it is not ideal for transmitting very long documents which may exceed an email servers size limitations. Furthermore,
the need for compatible software at the recipient of an electronic mail can be a formidable challenge to the recipient's use of the document, hi some cases
documents formatted for one application may be converted to another format for use by another application for editing. There are also document formats available which are relatively platform independent, such as the portable
document format PDF promulgated by Adobe Systems, which is based upon
the PostScript page description language and defines that the content of a
document graphically. However, a recipient wishing to extract data from a
received document that is in an unusual format may be unable to extract information therefrom.
In many business contexts, documents are generated by legacy
computer systems, in a format that is older than and incompatible with modern word processing systems. Files formatted for legacy computer systems of this type typically cannot be utilized by more modern computer systems and thus are not able to be usefully transmitted by electronic mail.
To manage these incompatibilities, software has been developed for retrofitting legacy computer systems for electronic document
delivery. Specifically, software or hardware is provided that emulates a printer
receiving a document from the legacy computer system. This software or hardware captures a Printer Control Language file generated by the legacy
computer system, which is then electronically transmitted from the legacy
computer system to recipient computer systems. The Printer Control Language file may then be sent to a printer at the recipient computer system with the same quality as can be achieved at the sender.
While this approach avoids the incompatibility problems of
legacy computer systems, it still suffers from various inherent problems.
Specifically, Printer Control Language files are not word processing formats,
and typically represent content of a document, including text, in a graphical form. As a consequence, Printer Control Language files cannot be readily
edited by the recipient. Furthermore, the recipient cannot readily extract data
from a Printer Control Language file, as compared to files formatted in modern
word processing formats. Thus, the recipient computer system cannot readily identify the nature of the document represented by a Printer Control Language
file, determine whether it is a form, and what particular type of form is being used, or extract data from that form, or distinguish the text that represents the
content of the form from graphical content or the text that defines the form.
This means that a recipient computer system has only limited use for a Printer
Control Language file delivered in accordance with the methods described above. Specifically, the file can be printed, and can be visually displayed. If the recipient desires to categorize the received file or extract data from it, the
recipient typically needs to manually review the content of the file in a printout
or on the screen to identify its content and manually extract data.
As a result of these difficulties with existing Printer Control Language based document sharing, it has been known to modify legacy
computer systems to produce cover pages or other simplified content
preceding a printed document. The purpose is to include identifying
information and readily identifiable data relating to the document, in a
simplified cover page rather than within the form itself. The cover page is then scanned by a recipient computer system to identify the form and data
relating to the document. The cover page is then removed and the remainder
of form is printed or displayed. Typically, the cover page is presented in a
very simple format that facilitates scanning of its corresponding Printer Control Language expression and extraction of content. Unfortunately, this approach suffers from the drawback that the legacy computer system must be modified so that each form contains a cover page. Such modifications are
cumbersome and may be difficult to achieve, particularly if there is a need to maintain compatibility with conventional uses of the legacy computer system.
Furthermore, customized software must be developed to capture data and form identification from cover pages, which may need to be unique to each computer system at issue.
Therefore, there is a need to process data in a primarily graphical electronic format to identify form content in that format, and extract data from a form stored in that format, without requiring modification of the form or inclusion of extraneous data in the form. There is further a need for
automatic identification of text and the meaning thereof within primarily
graphically described forms. Finally, there is a need to manage multiple printed forms which may be generated by sending computer system, to identify
those forms and determine which of several potential forms is being presented
by a graphically defined electronic file.
Summary of the Invention
These needs are met by the invention, which provides a method of recognizing the content of a document as part of electronic delivery thereof,
by generating a word list for the document, recognizing immutable content of the document that corresponds to a form used in generating the document, identifying meanings for mutable content of the document based upon positions of mutable content relative to immutable content, and then storing
the mutable content in association with the identified meanings thereof for
subsequent retrieval.
In one specific embodiment described below, the immutable content is identified by recognizing graphical boxes included in the document.
More specifically, graphical boxes are recognized by dividing said document into rectangles consistently with the positions of graphic lines in the document.
For efficiency, prior to recognition of graphical boxes, intersections between
horizontal and vertical lines in said document are corrected to create T intersections - to correct cases where a horizontal or vertical line ends near to
but not at a vertical or horizontal line, respectively.
In the specific embodiment described below, the documents at issue are home mortgage transaction documents generated from standard
banking and mortgage forms, and the immutable content of the forms used
includes not only graphical elements, but also words, such as the title of a
form, box content identifiers, legal form paragraphs, and the like. The document is processed to extract from this immutable content, mutable content
such as the borrower's and lender's name and address, and financial terms for a transaction, which are stored in one or more of an XML or MISMO SmartDoc
format for sharing, as data files, between clients of a document delivery service. However, numerous other applications are possible.
In this embodiment, a document is graphically described when
received, i.e., it is in the form of a PCL, PostScript, PDF or raster image.
Accordingly, generating a word list for the document involves matching a character map embedded within the document to graphical content of the
document, performing character recognition upon the graphical content, or other forms of word recognition.
In an alternative embodiment described herein, the immutable
content is in the form of words alone, and is identified by comparison of the word list for a document, to word lists of multiple known documents. In this embodiment, when the document word list is matched to the word list of a
known document, the document can be recognized as a relative of the known
document, typically both being the product of a common legal form. At that
point, the mutable content of the document may identified based upon the
immutable content and the specifics of the recognized form. Known
documents descriptive of form may be generated by capturing the word list for
a "blank" of each form in use. More robustly, known document word lists may
be generated by a process that identifies common word subsequences in a plurality of documents using the same form. The common subsequences can
then be used as a wordlist for recognizing other documents created with the same form.
A similar process for identifying common word subsequences
may also be used to compare a document to possible forms; the form that has the largest matching common word subsequence can then be determined to be
the form used in creating the document. This process permits recognition of a
document as the product of a form even where there are insertions and deletions between the document and form.
The above and other objects and advantages of the present
invention shall be made apparent from the accompanying drawings and the description thereof. Brief Description of the Drawing
Fig. 1 is a diagram of a network of computers including a
plurality of client computers and a server computer utilized in accordance with
principles of the present invention;
Fig. 2 A is an illustration of a HUD-I settlement statement
form;
Fig. 2B is an illustration of a sequence of form documents
generated by a legacy computer system on sequential pages, to be recognized and separated into individual files in accordance with principles of the present invention;
Fig. 3 is an process for identifying mutable and immutable text on a primarily graphically defined document and generating a data file representing the data on the form;
Fig. 4 is a process for identifying rectangles identifying sections
of a primarily graphically defined form and extracting data from rectangles on the form;
Fig. 4A is an illustration of horizontal correction of line intersections in the process of Fig. 4;
Fig. 4B is an illustration of vertical correction of line intersections in the process of Fig. 4; Fig. 4C is an illustration of exemplary documents containing
rectangles and the rectangles recognized therefrom according to the process of
Fig. 4;
Fig. 5 is an illustration of a page matching process for
identifying a form from text thereon and extracting data from the form;
Fig. 6 is an illustration of an enhanced page matching process for identifying a form from text thereon and extracting data from the form.
The accompanying drawings, which are incorporated in and constitute apart of this specification, illustrate embodiments of the invention and, together with a general description of the invention given above, and the
detailed description of the embodiments given below, serve to explain the principles of the invention.
Detailed Description of Specific Embodiments
Referring now to Fig. 1, a network of computers for carrying out the principles of the present invention can be described. At heart of this
computer network is a server 10 which operates in conjunction with a mass
storage facility 11. As will be discussed in a greater detail below, mass storage 11 includes received documents (which are typically filled-out forms and will be identified from time to time in the following as "forms"), template
documents, and data extracted from the forms using the templates. Server 10 interacts with a plurality of remotely located client computers via a network,
which in the illustrated embodiment is the Internet.
Principles of the present invention are applicable to the wide variety of potential applications involving the creation of documents and
extraction of data therefrom. For the purposes of illustration, the invention is
disclosed herein the context of a particular application, which is the
transmission of mortgage and real estate purchase and refinancing forms between a lender, purchaser, broker, attorney and other parties involved in a
real estate transaction. Accordingly, the clients illustrated in Fig. 1 and are those typically involved with the real estate purchase transaction, although
other applications and other types of client computers and maybe utilized consistent with principles of the present invention. In this illustrated embodiment, real estate purchase/refinance and mortgage forms are initially generated by a lender bank's home office, at a
client computer 12. hi a typical embodiment, and lender bank home office computer 12 may be a legacy computer system, such as a mainframe computer
system, which generates computerized forms in an unusual electronic format.
1 Legacy computer system are often utilized for such forms for the reason that such forms must be compliant with standards established by the mortgage
lender and state or local regulation of mortgage and real estate purchase
transactions. For this reason, typically a legacy computer systems has been extensively customized and revised to generate mortgage and real estate transaction forms and is not readily replaced with more modern computer systems.
It will be appreciated that real estate specific forms may also be
generated by other entities involved in the real estate transaction and, such as by the lender's or buyer's attorney, the buyer's or seller's real estate broker or by
the seller or buyer. Furthermore, it will be appreciated that although such forms may be generated by legacy systems, they may also be generated by
modern computer systems such as those using Microsoft's Windows operating system and word processing software. The present invention is adaptable to
forms generated by any or all of the above identified sources and types.
Forms generated by client computers and, such as the bank home office computer 12, may take a variety of forms. Forms managed by the system may be highly graphical forms 14, including graphical features such as
boxes and potentially icons or illustrations, providing instruction on the use of
form and providing a more highly formatted appearance when the form is completed. An example of such form is the HUD-I form promulgated by the United States Department of Housing and Urban Development, which is used
as a standard closing summary form in real estate transactions in United
States.
As seen in Fig. 2, the HUD-I form includes graphical boxes and highly formatted arrangements of information to present the purchaser, seller and lender details, and financing details on a real estate purchase
transaction. Some of these boxes, such as box 30 in the upper left of the form,
include only identifying information for the form, and do not include content
representing the data on the form. In other cases, such as box 32, the form
presents a series of check boxes and the data in the form is in the form of a
mark in one of those boxes. In further cases, such as box 34, a box includes an
identifying label such as "D. NAME AND ADDRESS OF BORROWER" and textual content that corresponds to that label.
The position of data for the form is typically consistent with the labels, but there may be special cases. For example, there is no box that is explicitly labeled to include the seller's Social Security number, but this
information is often positioned in box 36 which bears the label "G. PROPERTY LOCATION". However, as represented by arrow 38, in some cases this information is positioned in box 38 which bears the label "E. NAME AND ADDRESS OF SELLER". Similarly, although box 40 is labeled "I.
SETTLEMENT DATE", this box may list both the settlement date and the
date on which funds are disbursed.
In some cases on the HUD-I form, data is presented in a box that is adjacent to the box that labels that data. For example, the financial
terms of a transaction are presented in an area 42 of the HUD-I form, and in
this area 42, data appears separate from labels. Thus, the box labeled "101.
Contract Sales Price" does not itself identify the contract sales price, but rather
the contract sales price is identified in a box immediately adjacent and to the right thereof. The same pattern repeats elsewhere in area 42 of the form.
Returning now to Fig. 1, highly graphical forms such as are
represented at 14, are not the only forms generated in typical transactions such
as real estate transactions. For example, forms may including lengthy text,
e.g., legal recitations, which must be acknowledged by a buyer's or seller's or
lender's signature, such as represented at 16. Forms of this type are less elaborately formatted than HUD-I type forms 14, but such forms nevertheless
contain customized content, such as buyer, lender or seller names and addresses, property information, or financial information relating to transaction. Furthermore, such forms typically include a blank for signature by
the buyer, seller or lender or any combination of these. A third type of form includes even fewer customized fields, such as a disclosure document for providing information regarding a transaction to, typically, the buyer of real estate. Such forms are represented at 18, and may include customized content relating to transaction, such as the buyers or sellers name or the purchase price,
but may not require a signature and may not typically include the highly
formalized structure of boxes and other graphical items typical of the HUD-I form 14 and similar such forms. Referring now to Fig. 2B, it will be appreciated that multiple
forms of these various types may be generated by a legacy computer system in
a print stream. In one embodiment, a PC desktop program is used to capture the print stream from a legacy computer system and package that print stream for uploading to server 10, as described in U.S. Patent Application 10/702,204
filed November 5, 2003 and assigned to the assignee hereof, which is hereby
incorporated herein by reference in its entirety.
It will be appreciated that the print stream from the legacy computer system may include several sequential pages 46a, 46b, 46c, which may collectively constitute a single form, or may represent multiple forms. In
the case where multiple forms are included in a single captured print stream, in accordance with principles of the present invention, it is desirable to parse
such a stream, recognize each of the forms 14, 16 and 18 in the stream, and extract the data from each form in the stream for subsequent processing. This
may be done by desktop software such as described in the above referenced
patent application, or at server 10, using processes described below. Forms generated by an originating client computer may be delivered to the server 10 in a variety of potential formats. One example of a format that has traditionally been used to deliver forms for a transaction, is the
Hewlett-Packard developed Printer Control Language or PCL. PCL includes
instructions for controlling a printer, typically a laser jet printer, to place text and graphical elements onto a page. Because Printer Control Language is
understood by a large number of printers from various manufacturers, it has
traditionally been used to transmit documents in electronic form from legacy
computer systems to other computers, in a manner that does not require the use of the legacy format on the receiving computers.
Although a Printer Control Language version of forms 14, 16
and 18 is typical, in accordance with principles of the present invention, other
formats may also be utilized the delivery of documents from a client to a
server and other clients. Specifically, PostScript printer driver output may be transmitted from client computer 12 to server 10, and then utilized to deliver form documents to other client computers. PostScript is a page description language that differs from PCL in a number of manners, and is somewhat less
prevalent than PCL. However, PostScript has wide enough acceptance that PostScript drivers are available for most computer systems and most software packages used with those computer systems. In the event that such drivers are
not available, PCL may be converted to PostScript, using library utilities such as "JETPCL".
PostScript is an advantageous format for presenting graphics as compared to Printer Control Language. Furthermore, a derivative of
PostScript is utilized by the portable document format (PDF) popularized by Adobe Systems and used with the Adobe programs known as Acrobat and
Acrobat Reader. PostScript formatted forms, therefore, are more readily convertible to Adobe Acrobat format for delivery in PDF form, using library
utilities such as "PDFNET".
Beyond these well-known advantages of PostScript, in
accordance with principles of the present invention, it has been recognized that
PostScript maybe more readily utilized to extract text from a document,
because PostScript formatted documents include a character encoding table
that readily permits parsing of the document to identify the graphics therein
that represent text characters. Although Printer Control Language also includes a character encoding table, character encoding in PCL is typically
more difficult to digest and process because the PCL standard permits omission of encoding tables in the document, thus causing each document to have characters that must be uniquely decoded without the benefit of an
encoding table, to determine which characters are presented each point in the
document. Because PostScript character encoding maybe more easily managed, a PostScript document maybe more readily scanned to determine all of the characters utilized in that document.
Specifically, to recognize text in a PostScript formatted
document, the embedded character encoding in the document is identified and
utilized to compare graphical elements in the document to the character encoding, to determine which characters are being presented at each location in the document, after which the words and sentences presented on each page
may be extracted. Such a scanning function is known to those of skill in the
art, and can be accomplished, e.g., by library utilities such as "XPDFTEXT", a
library utility that returns an extracted word from a PDF formatted document each time the utility is invoked.
Thus, the accordance with principles of the present invention,
the text on a form may be extracted and converted to a word list, with each
word having a page location, permitting processing of those words to determine words that are customized form data and also permitting those words which are not customized form data to be matched against templates to
determine which form is being utilized. Returning to Fig. 1, when server 10 receives form content such as a form 14, 16 or 18, server 10 stores those received forms in mass storage
11. Forms may be received in PostScript format, or the reception by server 10 may involve conversion of forms received in PCL format to PostScript. Then,
server 10 processes those forms utilizing a program in server 10 and/or template data in mass storage 11, as described in the various embodiments reviewed below. The outcome of this processing is an identification of the
data contained in the forms - such as the lender's and borrower's names and
addresses, financial details of the transaction, and the like. This data is stored in mass storage device 11 in a retrievable data format such an extensible
markup language (XML) format, or more particularly the Mortgage Industry
Standards Management (MISMO) SmartDoc format.
After forms have been processed by server 10, those forms are
available for retrieval by other clients and other client computers. For
example, an attorney involved in the transaction working in an attorney computer 22 may desire to download an XML / MISMO SmartDoc version of
a FfUD-I financing statement, to compare the content of the financing
statement to the transaction information that the attorney has generated.
Simultaneously, or alternatively, an attorney may wish to download the HUD- 1 form 14 itself for printing at the attorney's computer 22. Accordingly the attorney may download a version 14' of the HUD-I form that has been
converted by server 10, such as a version converted from PCL to PostScript format, and then to portable document format or PDF.
Similarly, a real estate broker utilizing a broker's computer 24
may desire to retrieve XML or MISMO SmartDoc formatted documents such as 20 including financial details of the transaction or name and address
information, as well as reformatted HUD-I form 14' or other forms uploaded
to server 10. Finally, a buyer utilizing a buyer's computer 26 may receive some or all of the forms generated by the bank home office 12, in this case including not only the HUD-I form 14 but also a mortgage paper or
promissory note 16 and disclosure document 18, all of which may be
reformatted for use with electronic signature technology as shown at 16', 18'
and 20', and discussed further U.S. Patent Application Serial No. 11/076,665, filed March 10, 2005, which is hereby incorporated herein by reference.
Referring now to Fig. 3, the overall process for form
recognition in accordance with the principles of the present invention is
illustrated. A first step 100 in this process is to generate a word list for the
document. Generation of a word list may take a number of forms depending
upon the format in which the document is made available. For example, a PostScript formatted document may be decoded in step 102 by utilizing the
PostScript character table and the graphics presented therein to recognize
character graphics within the document and convert those character graphics to text letters. Strings of text letters can then be converted into words, and the
resulting words and locations thereof used for subsequent processing.
Alternatively, the document may be a raster graphic image such as a facsimile
file, in which case the file will not contain a character table that may be used in to recognize characters from graphics. In this situation, character recognition techniques 104 may applied to the document to identify characters, and then
words and their location, for subsequent processing. It will be appreciated that
documents in other formats may also be processed in accordance with principles in present invention. For example, some documents in a PCL
format may include a character table, that may be decoded to identify words and characters therein, although the methodology utilized may differ from the
methodology used for PostScript formatted documents as discussed above.
Following generation of a word list for the document, in step
106, the form recognition algorithm identifies the immutable text and graphic content in the form. This step may take a variety of possible forms depending
upon the specific form at issue and the recognition algorithm utilized. In some algorithms discussed below with reference to Figs. 5 and 6, the process of
locating immutable text in the form, includes recognition of the form from among a plurality of candidates. However, in some embodiments of the present invention, the form in use is known in advance, and processing may
therefore be specific to the previously known form. This latter approach is a typical use of the rectangle recognition algorithm of Fig. 4. Accordingly, the process illustrated in Fig. 3 includes an optional step 108 of recognizing the form based upon its immutable content, which is performed only in those
cases where the form is not known in advance.
Following identification of the immutable text of the specific
form in use, in step 110, the mutable content of the form, i.e., the information filled in the blanks of the form, is identified, and its meaning is determined based upon its position relative to the immutable content. Various embodiments of the invention perform this step in different ways. For
example, the mutable text may be recognized by its proximity and positioning
relative to the immutable text that has been recognized in step 106.
Alternatively, or in addition, mutable text may be recognized from its position
relative to graphic content recognized in step 106. After identification of
mutable content, its meaning may be determined based upon its positioning in the form, or any other relationship it has with immutable content. In one
example, the meaning of the mutable content may be determined from the text
that it immediately follows, for example, the name and address of a borrower
may follow the immutable text "D. NAME AND ADDRESS OF
BORROWER" as seen in box 34 of Fig. 2A. Alternatively the position of mutable text relative to graphic elements may identify its meaning. For example, in area 42 illustrated in Fig. 2 A, the meaning of mutable text such as
"175,000.000" is determined by the positioning of the box containing it relative to other boxes.
After the meaning of mutable text has been determined in step 112, that mutable text is associated with variable names reflecting the meaning
of the mutable text. Then in step 114, a data file, such as an XML, MISMO
SmartDoc, or other data file is stored containing the mutable text and the
variable names identifying the meanings of the mutable text, for later use as discussed above with reference to Fig. 1.
Referring now to Fig. 4, one embodiment of the present
invention recognizes immutable content of the form by identifying rectangles
therein. In this embodiment, a rectangle recognition algorithm 120 is utilized to divide the surface of the form into rectangles, which are then used for recognition of text and the meaning of that text on the form. Prior to recognizing rectangles, the algorithm of Fig. 4 corrects the positioning of
vertical and horizontal graphic lines on the form to improved the identification
of rectangles on the form.
The algorithm described herein utilizes the midline of horizontal and vertical lines to graphically define the position of those lines.
Thus, as seen in Fig. 4A, the algorithm will use the midline 124 of a horizontal line and the midline 126 of a vertical line. It will be appreciated, however, that
graphically defined lines include a finite width as seen at 125 and 127. The finite width of the graphically defined line adds a potential ambiguity and
source of error, specifically with respect to the exact position of the lines relative to each other.
As seen in Fig. 4A, midline 124 is defined to overlap with line 126, within the width 127 of the printed appearance of line 126. Lines 124
and 126 will thus have the printed appearance of T-intersection, however,
midline 124 but it does not intersect at a T-intersection with midline 126. The overlap of midline 124 with midline 126 will not be visible on a printed
document because the graphical width of line 126 and the graphic width of
line 124 as illustrated at 125 and 127 will contain each other, so that the intersection between the two lines will appear to be a proper T-intersection
even though there is overlap.
A similar situation arises when the midline of a graphically
defined line does not intersect to another line at a location that appears to be a
T-intersection. Specifically, a seen in Fig. 4A, midline 124' does not in
intersect midline 126' even though the printed width of line 124', as seen at 125', will overlap with the printed width of line 126', as seen at 127'.
A similar adjustment to that described above also needs to be
performed for vertical lines. Referring to Fig. 4B, it can be seen that a vertical line 132 having a printed width shown at 133 may not properly intersect with a horizontal line 134 having a printed width 135. Specifically, line 132 may extend beyond line 134 rather than forming a proper T-intersection, and again,
this inaccuracy may exist even though it is not visible on the printed page for the reason that the printed widths 135 and 133 of the respective lines mask the
excess overlap of line 132 with line 134. Similarly, as seen in Fig. 4B, line
132' does not properly intersect with line 134' shown in Fig. 4B, for the reason that line 132' stops short of line 134'. The improper intersection between lines
134' and line 132' may not be visible because the printed widths 133' and 135' mask it.
The lack of overlap or the excess of overlap, as illustrated in
Fig. 4A, must be corrected prior to identifying rectangles on a form, to
improve the accuracy of that recognition. Accordingly, returning to Fig. 4, in
step 122 a horizontal correction is performed at each line intersection. Specifically,
In the process of step 122, each vertical line is extended by a
predetermined amount e/2 at each end. Then, each horizontal line is analyzed
to determine whether its vertical position is within the vertical range of the
extended vertical line. If so, then if an end of the horizontal line is within e/2 of the horizontal position of the vertical line, then the position of that end of the horizontal line is set equal to the horizontal position of the vertical line.
Thus, if a horizontal line has an endpoint close to the horizontal position of a
vertical line, i.e., close to forming a T-intersection with a vertical line, but not exactly, the end point of the horizontal line is changed to equal the horizontal position of the vertical line, so that the end point of the horizontal line exactly
corresponds to the horizontal position of the vertical line. This involves
horizontally extending or shortening the end point of the horizontal line to
match the horizontal location of the vertical line. This compensation adjusts either of the inaccurate intersections illustrated in Fig. 4A to an accurate intersection.
The rectangle recognition algorithm of Fig. 4 also includes step
130 for correcting vertical lines. Specifically, each vertical line is evaluated to determine whether it is close to forming a T-intersection with a horizontal line
at either of its ends. If so, the vertical line is extended or shortened so that the
vertical position of the end of the vertical line matches the vertical position of
the adjacent horizontal line.
In the process of step 130, each horizontal line is extended by a
predetermined amount e/2 at each end. Then, each vertical line is analyzed to
determine whether its horizontal position is within the horizontal range of the
extended horizontal line. If so, then if the top of the vertical line is within e/2
of the vertical position of the horizontal line, then the top of the vertical line is set equal to the vertical position of the horizontal line, and, if the bottom of the
vertical line is within e/2 of the horizontal line, then the bottom of the vertical line is set equal to the vertical position of the horizontal line. Thus, if a vertical line has an endpoint close to the vertical position of a horizontal line,
i.e., close to forming a T-intersection with a horizontal line, but not exactly, the end point of the vertical line is changed to equal the vertical position of the horizontal line, so that the end point of the vertical line exactly corresponds to
the vertical position of the horizontal line. This involves vertically extending or shortening the end point of the vertical line to match the vertical location of
the horizontal line. This compensation adjusts either of the inaccurate intersections illustrated in Fig. 4A to an accurate intersection.
Following the adjustment steps described above, the algorithm
of Fig. 4 proceeds to break the form into rectangles to permit subsequent
recognition of text and its meaning, based upon the particular rectangle regions of the form. This process scans in a raster fashion through the form. The initial vertical position at the beginning of the process, is set in step 142 to be
the top margin of the form. Next, in step 144 a data structure is created to
represent an "open" rectangle. Open rectangles are defined by a data structure
identifying the left and right edges of the rectangle, and the vertical position at which the rectangle starts. The left, right, top and bottom printable area
margins (hereafter referenced as just the "margins") are treated as rectangle boundaries so that each page will contain at least one rectangle. Boxes within the page will create further rectangles, as discussed below.
Thus, to initiate the process in step 144, an open rectangle data
structure is created, describing an open rectangle extending from the left to the
right margin of the form, and starting at the current vertical position. Next, in
step 145 it is determined whether any non-checkbox vertical lines end(i.e., have their bottom end) at the current vertical position on the form. "Non- checkbox" vertical lines are vertical lines that are not part of checkbox, i.e.,
any of the vertical lines in Fig. 2 A except those that define the checkboxes in
box 32. (Checkboxes are rectangles or squares that have a predetermined maximum size, and are identified in this manner, and ignored during rectangle recognition.) If a non-checkbox vertical line end at the current vertical
position, this indicates that currently open rectangles must be merged to reflect
the absence of the vertical line. Therefore, if there is a vertical line ending at
the current vertical position, then in step 146, the open rectangles that join at
the horizontal position of the ending vertical line are closed, by storing in the data structure representing the rectangles, a closing vertical position that
equals the current vertical position. Next, one rectangle is opened, by forming
data structures representing the rectangles in step 147. The rectangle opened
in step 147 has its respective right and left edges at the horizontal positions of the left and right edges, respectively, of the closed rectangles to the left of and to the right of the ending vertical line. Thus, the ending of a vertical line will merge open rectangles by replacing them with one open rectangle beginning at
the vertical position of the end of the vertical line.
After thus creating a new rectangle at the ending position of the vertical line, processing returns to step 145 at which is determined whether there is another vertical line ending at the current vertical position. If so, then
additional processing according to steps 146 and 147 is performed to merge
additional rectangles.
After all ending vertical lines have been handled by steps 145- 147, the next process is to identify in step 148, any non-checkbox vertical lines that start (i.e., have their top end) at the current vertical position on the form.
If a non-checkbox vertical line starts at the current vertical position, this indicates that currently open rectangles must be divided to reflect the presence of the vertical line. Therefore, if there is a vertical line starting at the current
vertical position, then in step 149, the open rectangle that includes the
horizontal position of the starting vertical line is closed, by storing in the data
structure representing the rectangle, a closing vertical position that equals the
current vertical position. Next, two rectangles are opened, by forming data
structures representing those rectangles, in step 150. The two rectangles
opened in step 150 have their respective right and left edges, respectively, at the horizontal position of the vertical line, and have their left and right edges, respectively, at the left and right edges of the open rectangle that was closed in
step 149. Thus, the presence of a vertical line will split an open rectangle by
replacing that open rectangle with two open rectangles beginning at the vertical position of the end of the vertical line.
After thus creating new rectangles at the position of the starting vertical line, processing returns to step 148 at which is determined whether
there is another vertical line starting at the current vertical position. If so, then additional processing according to steps 149 and 150 is performed to create
additional rectangles, which may further divide open rectangles into a greater number of rectangles.
When all vertical lines ending and starting at the current vertical position have been processed through steps 145 and 148, then in step
152 the current vertical position is moved down to the next vertical coordinate.
If the bottom margin (end of page or EOP) has not yet been reached, then processing continues to step 154 in which it is determined whether there is a
non-checkbox horizontal line at the current vertical position. If there is a non-
checkbox horizontal line at the current vertical position, then in step 156, all
currently opened rectangles are closed, by marking the data structures
representing those rectangles as terminating at the current vertical position.
Processing than continues to step 144 in which a new open rectangle is created
at the current vertical position, extending from the left to right margin.
Thereafter, processing continues to step 146 discussed above, in which any vertical lines at the current vertical position are assessed to potentially break the rectangle created in step 144 into smaller rectangles reflecting the presence
of vertical lines.
Returning to step 154, if it is determined that there is no horizontal line at the current vertical position, then processing proceeds
directly from step 154 to step 146 to evaluate whether there are vertical lines at the current vertical position that will require division of currently opened rectangles into smaller rectangles.
The algorithm described in Fig. 4 proceeds through the form from one vertical position to another until the bottom margin of the form is
reached in step 152. At this point, there are no additional vertical positions to evaluate in the form, and the process of Fig. 4 proceeds to step 160.
Fig. 4C provides exemplary illustrations of the manner in
which the process of Fig. 4 will divide various forms including boxes, into
rectangles. Form 158 illustrated in Fig. 4C includes a single box generally
centered on the form, and thus is a simple example. This form 158 will be divided by the process of Fig. 4 into five rectangles. A first rectangle 158-1 is
defined in the area extending from the left, right and top margin to the horizontal line that defines the top of the box in the form 158. Three
rectangles 158-2, 158-3 and 158-4 are defined in the horizontal region where the box is positioned on the form, one to the left of, one to the right of, and one corresponding to the box. Finally, a fifth rectangle 158-5 is defined for the area extending from the horizontal line defining the bottom of the box, to
the left, right and bottom margins.
Form 159 is a more elaborate form that includes 14 adjacent boxes of irregular sizes. This form will be divided by the process of Fig. 4
into 19 rectangles. Notably, as with the simpler form 158, a rectangle will be
defined at the top and bottom of the form, extending from the left to right margins, and rectangles will be defined to the left and right of the boxes,
divided by the horizontal lines. The boxes in the original form correspond to
the 14 rectangles 159-3, 159-4, 159-7, 159-8, 159-9, 159-13, 159-14, 159-15, 159-18, 159-19, 159-20, 159-23, 159-24 and 159-25.
Step 160 and the following steps recognize and associate text in the form with fields, on a rectangle by rectangle basis, thus capturing the data
presented by the form. Specifically, in step 162 a rectangle of the form is
evaluated to identify the meaning of the content in the rectangle, by evaluating
whether there is immutable content contained the rectangle, i.e., text or a
graphic that is a known field identifier. For example, the rectangle may
\
include the known field name "D. NAME AND ADDRESS OF BORROWER:" If the rectangle is thereby identified, the rectangle may also contain mutable text which identifies the values for a field.
As noted above, the relationship between values and the
immutable text in a rectangle may not always be simple - a social security
number may be included in a rectangle having the immutable text "G. PROPERTY LOCATION" and a disbursement date may be included in a
rectangle having the immutable text "I. SETTLEMENT DATE". The process of steps 160 and the following must be designed to flexibly determine the mutable text once the immutable text has been identified, by for example
recognizing a social security number as distinct from a name or address based upon its sequencing of digits and dashes.
In the case where there is mutable text in a rectangle,
processing continues to step 164 and then to step 166 in which any text within
the rectangle is extracted and assigned to the appropriate variable or variables,
as identified by the immutable text or graphic in the rectangle identified in step 162, and if necessary by the nature of the mutable text (e.g., formatted as a social security number) or by the positioning of mutable text in other
rectangles.
In some cases, a rectangle may not contain mutable text, but may include a reference to an adjacent rectangle containing mutable text. For
example, the rectangle may contain the immutable text "101. Contract Sales Price", and be adjacent to another rectangle which contains the dollar figure
for the contract sales price, hi this case processing continues through step 168 to step 170 in which the text from the appropriate adjacent rectangle, selected based upon the immutable text, is extracted and assigned to the appropriate variable, e.g. the variable that is identified by the immutable text in the current
rectangle.
Some rectangles may contain check boxes, and the processing of such rectangles in steps 164 and 166 is slightly different, hi this case the
mutable text is analyzed to determine whether there is mutable text or graphic
(e.g. and "x" or check symbol) positioned inside one or more of the
checkboxes, and if so, then the immutable text adjacent to that checkbox is used to identify the meaning of the selected checkbox(es).
Some rectangles may not contain any text or may not contain mutable text which is associated with a value. For example, the rectangle may
be between boxes of the form and the margins. Or the rectangle may be part
of a box of the form, which only has the title of the form, e.g., "A. U.S.
DEPARTMENT OF HOUSING & URBAN DEVELOPMENT
SETTLEMENT STATEMENT", hi one embodiment, such content may be used to recognize the form, but beyond this use, such rectangles may be
ignored in the processing of steps 162 to 170 because they do not contain mutable text.
The processing of steps 162 through 170 is performed for each
rectangle identified on the form via the process of Fig. 4, after which all of the mutable text on the form that is associated with variable names, will be captured and stored for later delivery to clients, as discussed above with reference to Fig. 3.
The process described above with reference to Fig. 4 analyzes a
form by identifying rectangles on the form and thereafter identifying
immutable elements and mutable text in those rectangles, using the immutable text as a guide to the location and meaning of the mutable text. While this is
an efficient process for forms that include rectangles, an alternative process for
identifying mutable and immutable content of a form is required for form
documents such as those identified at 16 and 18 in Fig. 1, which do not include boxes that can be used for recognition.
One such alternative process, in accordance with principles of the present invention, is described in Fig. 5. The process of Fig. 5 extracts a
word list from an unknown received document, and compares it to the word
list from known forms, and uses the sequence of matching (immutable) and
nonmatching (mutable) words to determine which of the known forms is the
closest match, and identify the meaning of the mutable and immutable text and data on the form.
In the process of Fig. 5, in a first step 200 a word list, to be identified as "TEST", is extracted from the document to be matched. In the
case of a PostScript or PDF formatted source document, this step typically involves using a library function that uses the character map in the document
to recognize graphical patterns that correspond to individual characters, and then recognize sequences of such characters as words at particular positions on the document. The words on the document are then converted to a word list
by a raster scan of the document, left to right and top to bottom.
It will be appreciated that principles of the present invention may be applied to word lists generated in any manner, and to documents in multiple formats. For example, an ASCII or other standard word processing
format documents may be converted to a word list without requiring recognition of characters. Fax-formatted or other graphics formatted
documents may also be converted to word lists by optical character recognition (OCR) techniques and OCR library utilities known to those of skill in the art.
It will be appreciated, as discussed above with reference to Fig. 2B, that incoming documents may be multi-page, and those multiple pages
may constitute a single form or may include multiple forms in a single stream, requiring separation and individual matching of pages to known forms. In
such circumstances, the wordlist (i.e., "TEST") extracted from the incoming document, may be initially generated from only the first page only of a multi- page print stream or document. If a word list extracted only from the first page
does not adequately match any known forms, then a word list extracted from the first two pages may be compared to known forms, and so on. Once a set of pages in an incoming document is successfully matched to a known form, then the remaining pages may be processed by the same procedure, starting with a
single page, then multiple pages.
Following extraction of a wordlist from an incoming document,
or a page or pages thereof, this wordlist is compared to wordlists from each of several candidate forms that may be matchable to the incoming document.
These wordlists are generated in advance, for example by causing the source
client system to output a "blank" form, i.e., a form that contains all of the immutable content of a form, but does not contain any mutable content, and then converting the resulting "blank" form to a wordlist. Known wordlists
may also be generated manually by reviewing each form used by an organization and producing an optimal wordlist for matching to all known
versions of a form. Finally, known wordlists may be generated by causing a
client system to output wordlists for multiple versions of a given form (i.e.,
using different mutable content typical of different transaction types that use
the form), and then identifying the largest common subsequence of those
wordlists by repeated application of a process such as that discussed below with reference to Fig. 6. This last approach to generating known wordlists
offers the potential for periodic automatic updating of known wordlists in
response to changes in forms at the client system. Specifically, the server 10 may retain several recent versions of a recognized form, e.g., as those forms
are forwarded through server 10, and periodically use the retained recent versions to rebuild a known wordlist representing the largest common subsequence of words in those forms, which can replace the known wordlist currently in use if any changes are noted.
The comparison of the incoming, unknown wordlist TEST to
known wordlists is performed one wordlist at a time. Specifically, in step 202,
a known wordlist "KNOWN" that has not previously been evaluated, is selected from the available pool of known wordlists. Next, in step 204, the
pointer variables x and y are initialized to zero and a temporary file is created for storing a matched version of the incoming wordlist TEST. Then, a loop of
steps 206-214 is entered, which collectively compare words in TEST to words
in KNOWN. In step 206, a current word in TEST (at position x) is compared to a current word in KNOWN (at position y). If these two current words
match, then in step 208 the pointers x and y are incremented (skipping to the
next words in TEST and KNOWN) and the current word in TEST is stored in the temporary file as a matched, apparently immutable word. If, however, the
current words do not match, then in step 210 the pointer x is incremented (skipping to the next word in TEST), and the current word in TEST is stored in the temporary file as an unmatched, apparently mutable word. Thereafter, in steps 212 and 214 it is evaluated whether the end of the TEST or KNOWN wordlists have been reached, respectively, and if not, processing returns to step
206 to check for a match between the then-current words in TEST and
KNOWN.
When the end of either the TEST or KNOWN wordlists is reached, processing will proceed from either step 212 or 214, respectively, to step 216. At this step, the pointer y will identify the number of words in
KNOWN that were successfully matched to TEST. If the number of words y
matched to the KNOWN under evaluation is the largest number achieved so far, then in step 218 the temporary file storing the matched version of TEST, is stored as a current best match. In any case, in step 220, it is determined
whether further known wordlists are available for evaluation. If so, then in
step 222 the temporary file is erased for the next iteration, and processing
returns to step 202 to select another candidate known wordlist and evaluate its
match to the unknown wordlist TEST.
After all candidate known wordlists have been evaluated, processing continues from step 220 to step 224, in which the best match file
generated during the previous efforts is evaluated and compared to criteria that
establish when a "match" is considered accomplished. These criteria may, for
example, require that the matched wordlist include a limited number of unmatched / apparently mutable words, or include matching for all or
substantially all of the words in the known wordlist. If these criteria indicate a success, then processing continues to step 226, and the mutable text is extracted from the best match file, and its meaning recognized based upon its proximity to immutable text, in a manner analogous to the processing
discussed above with reference to Fig. 4, steps 162-170.
In the event that the best matching known wordlist does not
meet the criteria of step 224, a matching failure is returned in step 228. This may end the matching process, or as discussed above may cause the matching
process to restart, using a TEST wordlist created from a series of the pages of
a received stream beyond those used in the initial evaluation.
An example demonstrating the results of the process of Fig. 5 follows. In this example, the wordlist to be tested includes the following:
TEST=Nanie: Tim Ext.: xl23 Title: Programmer Spvsr.: Phil
Three known wordlists are matched to this test word list, as follows:
KNOWN #1 =Name: Ext.: Fax: Home: Title:
KNOWN #2 =Name: Ext: Fax: Home: Title: Sprvsr.: KNOWN #3 = Name: Ext. : Title:
Matching proceeds according to Fig. 5, resulting in the following temporary files for each known word list (underlining identifies words matched from the known word list):
Result #1 = Name: Tim Ext.: xl23 Title: Programmer Spvsr.: Phil
Matches = 2
Result #2 = Name: Tim Ext.: xl23 Title: Programmer Spvsr.: Phil
Matches = 2
Result #3 = Name: Time Ext.: xl23 Title: Programmer Spvsr.: Phil
Matches = 3
It may be observed from this example that the matching process of Fig. 5 successfully identifies that the immutable word "Name:" in each known wordlist is matchable to the test word list, and furthermore that this
immutable word is followed in the test word list by a mutable word "Tim"
(which would be implied to be a name). Furthermore, after the word "Tim",
each known word list includes the immutable word "Ext.", which is matched by the process of Fig. 5 to the word "Ext.:" in the TEST word list. The TESTt
word list includes the mutable word "xl23" after "Ext." (which would be implied to be a telephone extension). Furthermore, it can be seen that the third
known word list includes the mutable word "Title", which is also matched to
the test word list. After "Title" the TEST word list includes "Programmer
Spvsr. Phil", which would be implied to be a title. Thus, the matching
algorithm is capable of identifying mutable and immutable text and permitting
extraction of immutable text and identification of the meaning of that immutable text from proximity to mutable text.
However, it may also be observed that the process of Fig. 5 fails to identify the best match of the three known word lists to the TEST word list. Specifically, the best match of the three known word lists is KNOWN #2,
which not only includes immutable text for "Name:", "Ext." and "Title" (as is
the case with KNOWN #3), but also includes immutable text for "Spvsr.".
The process of Fig. 5 fails to match "Spvsr." in the TEST word list to KNOWN #2, for the reason that KNOWN #2 includes an immutable word
"Fax:" which is not included in the TEST wordlist. The omission of the
immutable word "Fax:" from the TEST wordlist prevents matching of the test
wordlist to KNOWN #2 beyond the immutable word "Ext." when using the process of Fig. 5, and thus prevents the best matching known wordlist from being identified.
This sort of difficulty, created by word deletions, is not
necessarily unusual in practical examples. While documents created from
forms typically include all of the immutable content of the "blank" form, plus additional mutable content, it is typical for forms to be updated from time to
time, and this updating may involve deleting immutable content from the
form. E.g., a field may be deleted or the immutable words in that field may be
simplified by deletion of words. Furthermore, in some circumstances a "blank" form, rather than being blank, may include placeholder words at the
location of mutable content, to provide for easier identification of the locations where mutable content is inserted. The "blank" form could for example
include placeholder words (e.g., unique gibberish words such as zzyzz, zyzzz, etc.) at each location of mutable text. (This approach might help to prevent immutable words of a known document from being confused with mutable words of a received document - for example, the last name of the borrower identified on a mortgage transaction form might be "Borrower", a mutable
word that might be matched to an immutable word "Borrower" on a "blank"
form. Such is less likely if the form, instead of being "blank", includes
gibberish words that can be picked up by the matching algorithm.) If a "blank" form includes them, placeholder words would appear at the locations of
mutable text in a wordlist made from the "blank" form, and would not match
to a TEST wordlist extracted from a received document.
Thus, in some circumstances, it may be advantageous to provide an algorithm capable of identifying not only additions to forms, but also deletions.
One such process is the common subsequence matching
process illustrated in Fig. 6. This process is used in a manner similar to the
process of Fig. 5, but performs a more sophisticated analysis of a TEST and
KNOWN wordlist to uniquely identify the longest subsequence of common
words in the two wordlists. This is accomplished, as in Fig. 5, by first extracting the TEST wordlist from the received, unknown document (step 250), and then selecting a KNOWN wordlist from the available pool (step 252).
The process of Fig. 6 uses two tables to store located
subsequence information; each table has a number of rows equal to the number of words in TEST (which is stored as "m" in step 254), and a number of columns equal to the number of words in KNOWN (which is stored as "n" in
step 254). The first table, known as bTABLE (which is dimensioned in step
256), provides data indicating the words in the longest common subsequence,
and the second table, known as cTABLE (also dimensioned in step 256), tracks the length of the longest subsequence that has been identified at a
particular time during execution of the algorithm. The content of cTABLE
and bTABLE, taken together, thus fully characterize the location and length of
the subsequences found in TEST and KNOWN.
In step 258, the cTABLE is initialized by setting the values in its 0th row and 0th column to 0 (representing that no common subsequences are
known at the start of processing), and in step 260, the loop variables i and j are
initialized to values of 1 , so that processing of KNOWN and TEST begins at the first word of each.
Processing in Fig. 6 proceeds through a douhle loop of steps
262 through 276, in which each word in TEST is compared sequentially to the word in the corresponding and later positions in KNOWN. By thus
proceeding systematically through a comparison of KNOWN and TEST, any common subsequence of KNOWN and TEST is reliably identified and captured by the information in cTABLE and bTABLE.
In a first step 262, it is determined whether the current word in
TEST (at position i) is the same as the current word in KNOWN (at position j). If so, then any common subsequence that has previously been identified in
the parts of TEST and KNOWN that precede words i and j, respectively, can be extended by one word, to include the matching words i and j. Accordingly,
in step 264, the value in cTABLE at (i-lj-1), which represents the length of
the longest subsequence found up to words i-1 and j-1 in TEST and KNOWN, is incremented by 1, and the result is stored in cTABLE at entry (i,j). Entry (i,j) in cTABLE thus reflects the length of the longest subsequence found, up to words i and j of TEST and KNOWN. Also in step 264, the bTABLE entry
at (i,j) is updated to store the value "\", a symbol that represents that words (i,j)
of TEST and KNOWN are matching words (the manner in which the symbols
of bTABLE interplay to identify subsequences will be appreciated from the
examples provided below).
In the event that word i of TEST and word j of KNOWN do not match in step 262, processing continues to step 266, in which a test is
performed to determine the longest previously-identified common subsequence of TEST and KNOWN. Specifically, in step 266, the value of
cTABLE at (i-lj) is compared to the value of cTABLE and (ij-1). This tests determines whether a longer subsequence was found in the immediately
preceding comparison of word i-1 of TEST to word j of KNOWN, or in the preceding (and earlier) comparison of word i of TEST to word j-1 of
KNOWN. In the event that the cTABLE entry at (i-lj) is greater than or equals the cTABLE entry at (ij-1), this indicates that the longest subsequence
was found in the preceding comparison of word i-1 of TEST to word j of
KNOWN, and in this case in step 268 the value of cTABLE at entry (i-lj) is stored in cTABLE at the entry (i j), thus reflecting that the longest
subsequence has the same length after word i of TEST as before word i of TEST (because word i of TEST did not match word j of KNOWN). Also in
step 268, the symbol "Λ" is placed in bTABLE at entry (i j), reflecting that the longest known subsequence at word i of TEST and word j of KNOWN
continues from word i-1 of TEST and word j of KNOWN.
If the comparison of step 266 is false, this indicates that the
longest subsequence was found in the preceding comparison of word i of
TEST to word j-1 of KNOWN, an in this case in step 270 the value of
cTABLE at entry (ij-1) stored in cTABLE at entry (ij), thus reflecting that the
longest subsequence has the same length after word j of KNOWN as before word j of KNOWN (because word i of TEST did not match word j of KNOWN). Also in step 270, the symbol "<" is placed in bTABLE at entry (i j), reflecting that the longest known subsequence at word i of TEST and
word j of KNOWN continues from word i of TEST and word j-1 of KNOWN.
After step 264, 268 or 270, in step 272 it is determined whether the end of the TEST wordlist has been reached. If not, then the pointer i is incremented in step 274, to begin comparison of the next word of TEST with the current word of KNOWN in the above-described manner, and processing
returns to step 262.
When the end of the TEST wordlist is reached, processing continues to step 276, at which it is determined whether the end of the KNOWN wordlist has been reached. If not, then the pointer j is incremented in step 278, and the pointer i is reset to 1, to begin comparison of the first word
of TEST with the next word of KNOWN in the above-described manner, and processing returns to step 262.
When the end of the KNOWN wordlist is reached, the double loop of steps described above is complete, and processing continues from step
276 to step 280. In step 280 the cTABLE and bTABLE are evaluated to determine the quality of the match between TEST and the currently selected
KNOWN wordlist. Specifically, the largest value in the cTABLE represents
the length of the longest matches subsequence of TEST and KNOWN, and is a good indicator of the quality of match between TEST and the current KNOWN. If in step 280 the largest value in the cTABLE is larger than the
current best match, then in step 282 the cTABLE and bTABLE are stored as the current best match. After these steps, in step 284 it is determined whether there are further known wordlists to compare to the TEST wordlist, and if so,
processing returns to step 252 to select a remaining known wordlist for
comparison. After all candidate known wordlists have been evaluated,
processing continues from step 284 to step 286, in which the best match
generated during the previous efforts is evaluated and compared to criteria that establish when a "match" is considered accomplished. These criteria may, for
example, require that the matched wordlist include a limited number of unmatched / apparently mutable words, or include matching for all or substantially all of the words in the known wordlist. If these criteria indicate a
success, then processing continues to step 288, and the mutable text is
extracted from the best match file, and its meaning recognized based upon its
proximity to immutable text, in a manner analogous to the processing
discussed above with reference to Fig. 4, steps 162-170.
In the event that the best matching known wordlist does not meet the criteria of step 286, a matching failure is returned in step 290. This may end the matching process, or as discussed above may cause the matching
process to restart, using a TEST wordlist created from a series of the pages of a received stream beyond those used in the initial evaluation.
An example demonstrating the results of the process of Fig. 6 follows. In this example, the wordlist to be tested includes the following:
TEST=Name: Tim Ext.: xl23 Title: Programmer Spvsr.: Phil Three known wordlists are matched to this test word list, as follows:
KNOWN #1 = Name: Ext.: Fax: Home: Title:
KNOWN #2 = Name: Ext.: Fax: Home: Title: Sprvsr.:
KNOWN #3 = Name: Ext.: Title:
Matching proceeds according to Fig. 5, resulting in the following cTABLE and
bTABLE results, indicating the identified subsequences each known word list
(underlining identifies words matched from the known word list):
KNOWN #1 = Name: Ext.: Fax: Home: Title:
TEST= Name: Tim Ext.: x!23 Title: Programmer Spvsr.: Phil
Result #1 = Name: Tim Ext.: xl23 Title: Programmer Spvsr: Phil Matches = 3
KNOWN #2 = Name: Ext.: Fax: Home: Title: Sprvsr.:
TEST= Name: Tim Ext.: xl23 Title: Programmer Spvsr.: Phil
KNOWN #3 = Name: Ext.: Title:
TEST= Name: Tim Ext.: xl23 Title: Programmer Spvsr.: Phil
Result #1 = Name: Tim Ext.: xl23 Title: Programmer Spvsr: Phil Matches = 3
It may be observed from this example that the matching process
of Fig. 6 successfully identifies more matches in known wordlists than the process of Fig. 5. Specifically, the matching process of Fig. 6 not only
determines that the immutable words "Name:", "Ext." and "Title" of the third word list are matched to the test word list, but furthermore, the matching
process of Fig. 6 determines that KNOWN #2 is the best match of the three known word lists to the TEST word list by matching not only "Name:", "Ext." and "Title", but also matching "Spvsr.". The process of Fig. 6 thus matches "Spvsr." in the TEST word list to KNOWN #2, even though KNOWN #2
includes the word "Fax:" which is not included in the TEST wordlist. The omission of the immutable word "Fax:" from the TEST wordlist, or other deletions, does not prevent matching of the test wordlist to words beyond the deleted words in the known wordlist, and thus permits the best matching
known wordlist to be identified.
While the present invention has been illustrated by a
description of various embodiments and while these embodiments have been described in considerable detail, it is not the intention of the applicants to restrict or in any way limit the scope of the appended claims to such detail.
Additional advantages and modifications will readily appear to those skilled in the art. The invention in its broader aspects is therefore not limited to the
specific details, representative apparatus and method, and illustrative example shown and described. Accordingly, departures may be made from such details
without departing from the spirit or scope of applicant's general inventive concept.
What is claimed is:
Claims
1. A method of recognizing the content of a document as part of
electronic delivery thereof, comprising generating a word list for said document, recognizing immutable content of said document that correspond to a form used in generating the document,
identifying meanings for mutable content included in said word list based upon positions of words in said word list relative to recognized immutable content, storing said mutable content in association with identified
meanings thereof for subsequent retrieval.
2. The method of claim 1 wherein said immutable content comprises words in said word list.
3. The method of claim 1 wherein said immutable content
comprises graphical content in said document.
4. The method of claim 1 wherein said document is graphically described.
5. The method of claim 4 wherein generating said word list comprises one or more of: matching a character map embedded within said
document to graphical content of said document and performing character recognition upon said graphical content.
6. The method of claim 1 wherein said immutable content is identified by comparison of said word list to a word list of a known document.
7. The method of claim 6 wherein said document is recognized as the product of a form upon recognition of immutable content therein as
matching a known document descriptive of the form.
8. The method of claim 7 wherein the known document
descriptive of the form is generated by identification of common word
subsequences in a plurality of documents generated from the form.
9. The method of claim 6 wherein said immutable content is
identified by identification of common word subsequences in said word list
and a word list of a known document.
10. The method of claim 6 wherein said immutable content is identified by comparison of said word list to word lists of a plurality of known documents.
11. The method of claim 1 wherein said mutable content is
stored in one or more of an XML or MISMO SmartDoc format.
12. The method of claim 1 wherein said immutable content is
identified by recognizing graphical boxes included in said document.
13. The method of claim 12 wherein said graphical boxes are
recognized by dividing said document into rectangles consistently with the positions of graphic lines in said document.
14. The method of claim 13 wherein prior to recognition of said
graphical boxes, intersections between horizontal and vertical lines in said document are corrected to create T intersections between horizontal and vertical lines when a horizontal or vertical line ends near to but not at a
vertical or horizontal line, respectively.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US25625605A | 2005-10-21 | 2005-10-21 | |
US11/256,256 | 2005-10-21 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2007050372A2 true WO2007050372A2 (en) | 2007-05-03 |
WO2007050372A3 WO2007050372A3 (en) | 2007-12-06 |
Family
ID=37968371
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2006/040619 WO2007050372A2 (en) | 2005-10-21 | 2006-10-18 | Document recognition method |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2007050372A2 (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1679625A2 (en) * | 2005-01-10 | 2006-07-12 | Xerox Corporation | Method and apparatus for structuring documents based on layout, content and collection |
-
2006
- 2006-10-18 WO PCT/US2006/040619 patent/WO2007050372A2/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1679625A2 (en) * | 2005-01-10 | 2006-07-12 | Xerox Corporation | Method and apparatus for structuring documents based on layout, content and collection |
Non-Patent Citations (2)
Title |
---|
CHUNG C Y ET AL INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS: "Reverse engineering for Web data: from visual to semantic structures" PROCEEDINGS 18TH. INTERNATIONAL CONFERENCE ON DATA ENGINEERING. (ICDE'2002). SAN JOSE, CA, FEB. 26 - MARCH 1, 2002, INTERNATIONAL CONFERENCE ON DATA ENGINEERING. (ICDE), LOS ALAMITOS, CA : IEEE COMP. SOC, US, vol. CONF. 18, 26 February 2002 (2002-02-26), pages 53-63, XP010588199 ISBN: 0-7695-1531-2 * |
HAN W ET AL: "WRAPPING WEB DATA INTO XML" SIGMOD RECORD, SIGMOD, NEW YORK, NY, US, vol. 30, no. 3, September 2001 (2001-09), pages 33-38, XP009016073 ISSN: 0163-5808 * |
Also Published As
Publication number | Publication date |
---|---|
WO2007050372A3 (en) | 2007-12-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210012056A1 (en) | Integrated document editor | |
US20050289182A1 (en) | Document management system with enhanced intelligent document recognition capabilities | |
US8660294B2 (en) | Form data extraction without customization | |
US7668372B2 (en) | Method and system for collecting data from a plurality of machine readable documents | |
Déjean et al. | A system for converting PDF documents into structured XML format | |
JPH06111056A (en) | System and method for data processing, which corrects error in character recognition of digital image of document format | |
US20060285746A1 (en) | Computer assisted document analysis | |
JP2008276766A (en) | Form automatic filling method and device | |
US10178248B2 (en) | Computing device for generating a document by combining content data with form data | |
US20150310269A1 (en) | System and Method of Using Dynamic Variance Networks | |
US20110161303A1 (en) | System and method for analyzing official notices of electronically filed patent applications | |
JP2008059157A (en) | Document confirmation support system, document confirmation support device and program | |
US7356458B1 (en) | Multi-language correspondence/form generator | |
EP1256900A1 (en) | Database entry system and method employing optical character recognition | |
CN112149679B (en) | Method and device for extracting document elements based on OCR character recognition | |
WO2007050372A2 (en) | Document recognition method | |
US9727287B2 (en) | Data transfer system, method of transferring data, and system | |
US8380690B2 (en) | Automating form transcription | |
Colesnicov et al. | Support for the semi-automated recognition of the scans of documents with heterogeneous content | |
JPH08167003A (en) | Document processor | |
Garris et al. | Federal Register Document Image Database: NIST Special Database 25, Volume 1 | |
JP2004220181A (en) | Correction system of character recognition result, centralized business management system, correction method and program | |
JPH10326313A (en) | Document format generating device | |
CA2571092A1 (en) | Document output processing using content data and form data | |
AU4380101A (en) | Database entry system and method employing optical character recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
NENP | Non-entry into the national phase in: |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 06826142 Country of ref document: EP Kind code of ref document: A2 |