WO2007050372A2

WO2007050372A2 - Document recognition method

Info

Publication number: WO2007050372A2
Application number: PCT/US2006/040619
Authority: WO
Inventors: Timothy John Boemker
Original assignee: Elynx, Ltd.
Priority date: 2005-10-21
Filing date: 2006-10-18
Publication date: 2007-05-03
Also published as: WO2007050372A3

Abstract

The content of a document is recognized as part of electronic delivery thereof, by generating (100) a word list for the document, recognizing (106) immutable content of the document that corresponds to a form used in generating the document, identifying (112) meanings for mutable content of the document based upon positions of mutable content relative to immutable content, and then storing (114) the mutable content in association with the identified meanings thereof for subsequent retrieval.

Description

ATTORNEY DOCKET NO: ELYN/11

DOCUMENT RECOGNITION METHOD Field of the Invention

The present invention relates to delivery of documents and data from and between client computers and a server and a computer network.

Background of the Invention

Modern technology has developed a number of methods for

delivering documents between a sender and a recipient that are alternative to

the traditional physical delivery of paper copies. One older example is the use of facsimile technology. Facsimile are widely used at the present time for distribution of simple documents but facsimile transmission has numerous

drawbacks. First, the quality of the printed document at the recipient is low,

and clearly a reduction from the original. This may result in a loss of content or at least readability. Furthermore, facsimile documents are stored in a raster

graphic form and cannot be easily edited. For example, unless a complex conversion is performed, the recipient cannot change the text with a text

editor, or move the graphics on the document, and the like. Furthermore, facsimile transmissions take a relatively long time to complete, particularly for

long elaborate and complex documents. Facsimile transmissions may be

stored electronically so that they may be preserved, printed multiple times, or forwarded electronically, but such uses of facsimile transmission data do not

overcome the difficulties of quality, presentation and time of transmission that are typical of facsimile methods. Another widely used technology for distributing documents is

electronic mail. Electronic mail permits a document to be transmitted

electronically from one computer to another, and offers the advantages of convenience, electronic storage, delivery of document in its native electronic

format. As a consequence, documents delivered via electronic mail may be printed at the same quality as they are transmitted, and may be edited by the recipient using same software application in which they were generated by the sender.

Although electronic mail is thus preferable in many ways to

facsimile transmission, electronic mail has drawbacks of its own. Specifically,

electronic mail is not secure, and it is not ideal for transmitting very long documents which may exceed an email servers size limitations. Furthermore,

the need for compatible software at the recipient of an electronic mail can be a formidable challenge to the recipient's use of the document, hi some cases

documents formatted for one application may be converted to another format for use by another application for editing. There are also document formats available which are relatively platform independent, such as the portable

document format PDF promulgated by Adobe Systems, which is based upon

the PostScript page description language and defines that the content of a

document graphically. However, a recipient wishing to extract data from a received document that is in an unusual format may be unable to extract information therefrom.

In many business contexts, documents are generated by legacy

computer systems, in a format that is older than and incompatible with modern word processing systems. Files formatted for legacy computer systems of this type typically cannot be utilized by more modern computer systems and thus are not able to be usefully transmitted by electronic mail.

To manage these incompatibilities, software has been developed for retrofitting legacy computer systems for electronic document

delivery. Specifically, software or hardware is provided that emulates a printer

receiving a document from the legacy computer system. This software or hardware captures a Printer Control Language file generated by the legacy

computer system, which is then electronically transmitted from the legacy

computer system to recipient computer systems. The Printer Control Language file may then be sent to a printer at the recipient computer system with the same quality as can be achieved at the sender.

While this approach avoids the incompatibility problems of

legacy computer systems, it still suffers from various inherent problems.

Specifically, Printer Control Language files are not word processing formats,

and typically represent content of a document, including text, in a graphical form. As a consequence, Printer Control Language files cannot be readily edited by the recipient. Furthermore, the recipient cannot readily extract data

from a Printer Control Language file, as compared to files formatted in modern

word processing formats. Thus, the recipient computer system cannot readily identify the nature of the document represented by a Printer Control Language

file, determine whether it is a form, and what particular type of form is being used, or extract data from that form, or distinguish the text that represents the

content of the form from graphical content or the text that defines the form.

This means that a recipient computer system has only limited use for a Printer

Control Language file delivered in accordance with the methods described above. Specifically, the file can be printed, and can be visually displayed. If the recipient desires to categorize the received file or extract data from it, the

recipient typically needs to manually review the content of the file in a printout

or on the screen to identify its content and manually extract data.

As a result of these difficulties with existing Printer Control Language based document sharing, it has been known to modify legacy

computer systems to produce cover pages or other simplified content

preceding a printed document. The purpose is to include identifying

information and readily identifiable data relating to the document, in a

simplified cover page rather than within the form itself. The cover page is then scanned by a recipient computer system to identify the form and data

relating to the document. The cover page is then removed and the remainder of form is printed or displayed. Typically, the cover page is presented in a

very simple format that facilitates scanning of its corresponding Printer Control Language expression and extraction of content. Unfortunately, this approach suffers from the drawback that the legacy computer system must be modified so that each form contains a cover page. Such modifications are

cumbersome and may be difficult to achieve, particularly if there is a need to maintain compatibility with conventional uses of the legacy computer system.

Furthermore, customized software must be developed to capture data and form identification from cover pages, which may need to be unique to each computer system at issue.

Therefore, there is a need to process data in a primarily graphical electronic format to identify form content in that format, and extract data from a form stored in that format, without requiring modification of the form or inclusion of extraneous data in the form. There is further a need for

automatic identification of text and the meaning thereof within primarily

graphically described forms. Finally, there is a need to manage multiple printed forms which may be generated by sending computer system, to identify

those forms and determine which of several potential forms is being presented

by a graphically defined electronic file. Summary of the Invention

These needs are met by the invention, which provides a method of recognizing the content of a document as part of electronic delivery thereof,

by generating a word list for the document, recognizing immutable content of the document that corresponds to a form used in generating the document, identifying meanings for mutable content of the document based upon positions of mutable content relative to immutable content, and then storing

the mutable content in association with the identified meanings thereof for

subsequent retrieval.

In one specific embodiment described below, the immutable content is identified by recognizing graphical boxes included in the document.

More specifically, graphical boxes are recognized by dividing said document into rectangles consistently with the positions of graphic lines in the document.

For efficiency, prior to recognition of graphical boxes, intersections between

horizontal and vertical lines in said document are corrected to create T intersections - to correct cases where a horizontal or vertical line ends near to

but not at a vertical or horizontal line, respectively.

In the specific embodiment described below, the documents at issue are home mortgage transaction documents generated from standard

banking and mortgage forms, and the immutable content of the forms used

includes not only graphical elements, but also words, such as the title of a form, box content identifiers, legal form paragraphs, and the like. The document is processed to extract from this immutable content, mutable content

such as the borrower's and lender's name and address, and financial terms for a transaction, which are stored in one or more of an XML or MISMO SmartDoc

format for sharing, as data files, between clients of a document delivery service. However, numerous other applications are possible.

In this embodiment, a document is graphically described when

received, i.e., it is in the form of a PCL, PostScript, PDF or raster image.

Accordingly, generating a word list for the document involves matching a character map embedded within the document to graphical content of the

document, performing character recognition upon the graphical content, or other forms of word recognition.

In an alternative embodiment described herein, the immutable

content is in the form of words alone, and is identified by comparison of the word list for a document, to word lists of multiple known documents. In this embodiment, when the document word list is matched to the word list of a

known document, the document can be recognized as a relative of the known

document, typically both being the product of a common legal form. At that

point, the mutable content of the document may identified based upon the

immutable content and the specifics of the recognized form. Known

documents descriptive of form may be generated by capturing the word list for a "blank" of each form in use. More robustly, known document word lists may

be generated by a process that identifies common word subsequences in a plurality of documents using the same form. The common subsequences can

then be used as a wordlist for recognizing other documents created with the same form.

A similar process for identifying common word subsequences

may also be used to compare a document to possible forms; the form that has the largest matching common word subsequence can then be determined to be

the form used in creating the document. This process permits recognition of a

document as the product of a form even where there are insertions and deletions between the document and form.

The above and other objects and advantages of the present

invention shall be made apparent from the accompanying drawings and the description thereof. Brief Description of the Drawing

Fig. 1 is a diagram of a network of computers including a

plurality of client computers and a server computer utilized in accordance with

principles of the present invention;

Fig. 2 A is an illustration of a HUD-I settlement statement

form; Fig. 2B is an illustration of a sequence of form documents

generated by a legacy computer system on sequential pages, to be recognized and separated into individual files in accordance with principles of the present invention;

Fig. 3 is an process for identifying mutable and immutable text on a primarily graphically defined document and generating a data file representing the data on the form;

Fig. 4 is a process for identifying rectangles identifying sections

of a primarily graphically defined form and extracting data from rectangles on the form;

Fig. 4A is an illustration of horizontal correction of line intersections in the process of Fig. 4;

Fig. 4B is an illustration of vertical correction of line intersections in the process of Fig. 4; Fig. 4C is an illustration of exemplary documents containing

rectangles and the rectangles recognized therefrom according to the process of

Fig. 4;

Fig. 5 is an illustration of a page matching process for

identifying a form from text thereon and extracting data from the form;

Fig. 6 is an illustration of an enhanced page matching process for identifying a form from text thereon and extracting data from the form. The accompanying drawings, which are incorporated in and constitute apart of this specification, illustrate embodiments of the invention and, together with a general description of the invention given above, and the

detailed description of the embodiments given below, serve to explain the principles of the invention.

Detailed Description of Specific Embodiments

Referring now to Fig. 1, a network of computers for carrying out the principles of the present invention can be described. At heart of this

computer network is a server 10 which operates in conjunction with a mass

storage facility 11. As will be discussed in a greater detail below, mass storage 11 includes received documents (which are typically filled-out forms and will be identified from time to time in the following as "forms"), template

documents, and data extracted from the forms using the templates. Server 10 interacts with a plurality of remotely located client computers via a network,

which in the illustrated embodiment is the Internet.

Principles of the present invention are applicable to the wide variety of potential applications involving the creation of documents and

extraction of data therefrom. For the purposes of illustration, the invention is

disclosed herein the context of a particular application, which is the

transmission of mortgage and real estate purchase and refinancing forms between a lender, purchaser, broker, attorney and other parties involved in a real estate transaction. Accordingly, the clients illustrated in Fig. 1 and are those typically involved with the real estate purchase transaction, although

other applications and other types of client computers and maybe utilized consistent with principles of the present invention. In this illustrated embodiment, real estate purchase/refinance and mortgage forms are initially generated by a lender bank's home office, at a

client computer 12. hi a typical embodiment, and lender bank home office computer 12 may be a legacy computer system, such as a mainframe computer

system, which generates computerized forms in an unusual electronic format.

1 Legacy computer system are often utilized for such forms for the reason that such forms must be compliant with standards established by the mortgage

lender and state or local regulation of mortgage and real estate purchase

transactions. For this reason, typically a legacy computer systems has been extensively customized and revised to generate mortgage and real estate transaction forms and is not readily replaced with more modern computer systems.

It will be appreciated that real estate specific forms may also be

generated by other entities involved in the real estate transaction and, such as by the lender's or buyer's attorney, the buyer's or seller's real estate broker or by

the seller or buyer. Furthermore, it will be appreciated that although such forms may be generated by legacy systems, they may also be generated by modern computer systems such as those using Microsoft's Windows operating system and word processing software. The present invention is adaptable to

forms generated by any or all of the above identified sources and types.

Forms generated by client computers and, such as the bank home office computer 12, may take a variety of forms. Forms managed by the system may be highly graphical forms 14, including graphical features such as

boxes and potentially icons or illustrations, providing instruction on the use of

form and providing a more highly formatted appearance when the form is completed. An example of such form is the HUD-I form promulgated by the United States Department of Housing and Urban Development, which is used

as a standard closing summary form in real estate transactions in United

States.

As seen in Fig. 2, the HUD-I form includes graphical boxes and highly formatted arrangements of information to present the purchaser, seller and lender details, and financing details on a real estate purchase

transaction. Some of these boxes, such as box 30 in the upper left of the form,

include only identifying information for the form, and do not include content

representing the data on the form. In other cases, such as box 32, the form

presents a series of check boxes and the data in the form is in the form of a

mark in one of those boxes. In further cases, such as box 34, a box includes an identifying label such as "D. NAME AND ADDRESS OF BORROWER" and textual content that corresponds to that label.

The position of data for the form is typically consistent with the labels, but there may be special cases. For example, there is no box that is explicitly labeled to include the seller's Social Security number, but this

information is often positioned in box 36 which bears the label "G. PROPERTY LOCATION". However, as represented by arrow 38, in some cases this information is positioned in box 38 which bears the label "E. NAME AND ADDRESS OF SELLER". Similarly, although box 40 is labeled "I.

SETTLEMENT DATE", this box may list both the settlement date and the

date on which funds are disbursed.

In some cases on the HUD-I form, data is presented in a box that is adjacent to the box that labels that data. For example, the financial

terms of a transaction are presented in an area 42 of the HUD-I form, and in

this area 42, data appears separate from labels. Thus, the box labeled "101.

Contract Sales Price" does not itself identify the contract sales price, but rather

the contract sales price is identified in a box immediately adjacent and to the right thereof. The same pattern repeats elsewhere in area 42 of the form.

Returning now to Fig. 1, highly graphical forms such as are

represented at 14, are not the only forms generated in typical transactions such

as real estate transactions. For example, forms may including lengthy text, e.g., legal recitations, which must be acknowledged by a buyer's or seller's or

lender's signature, such as represented at 16. Forms of this type are less elaborately formatted than HUD-I type forms 14, but such forms nevertheless

contain customized content, such as buyer, lender or seller names and addresses, property information, or financial information relating to transaction. Furthermore, such forms typically include a blank for signature by

the buyer, seller or lender or any combination of these. A third type of form includes even fewer customized fields, such as a disclosure document for providing information regarding a transaction to, typically, the buyer of real estate. Such forms are represented at 18, and may include customized content relating to transaction, such as the buyers or sellers name or the purchase price,

but may not require a signature and may not typically include the highly

formalized structure of boxes and other graphical items typical of the HUD-I form 14 and similar such forms. Referring now to Fig. 2B, it will be appreciated that multiple

forms of these various types may be generated by a legacy computer system in

a print stream. In one embodiment, a PC desktop program is used to capture the print stream from a legacy computer system and package that print stream for uploading to server 10, as described in U.S. Patent Application 10/702,204

filed November 5, 2003 and assigned to the assignee hereof, which is hereby

incorporated herein by reference in its entirety. It will be appreciated that the print stream from the legacy computer system may include several sequential pages 46a, 46b, 46c, which may collectively constitute a single form, or may represent multiple forms. In

the case where multiple forms are included in a single captured print stream, in accordance with principles of the present invention, it is desirable to parse

such a stream, recognize each of the forms 14, 16 and 18 in the stream, and extract the data from each form in the stream for subsequent processing. This

may be done by desktop software such as described in the above referenced

patent application, or at server 10, using processes described below. Forms generated by an originating client computer may be delivered to the server 10 in a variety of potential formats. One example of a format that has traditionally been used to deliver forms for a transaction, is the

Hewlett-Packard developed Printer Control Language or PCL. PCL includes

instructions for controlling a printer, typically a laser jet printer, to place text and graphical elements onto a page. Because Printer Control Language is

understood by a large number of printers from various manufacturers, it has

traditionally been used to transmit documents in electronic form from legacy

computer systems to other computers, in a manner that does not require the use of the legacy format on the receiving computers.

Although a Printer Control Language version of forms 14, 16

and 18 is typical, in accordance with principles of the present invention, other formats may also be utilized the delivery of documents from a client to a

server and other clients. Specifically, PostScript printer driver output may be transmitted from client computer 12 to server 10, and then utilized to deliver form documents to other client computers. PostScript is a page description language that differs from PCL in a number of manners, and is somewhat less

prevalent than PCL. However, PostScript has wide enough acceptance that PostScript drivers are available for most computer systems and most software packages used with those computer systems. In the event that such drivers are

not available, PCL may be converted to PostScript, using library utilities such as "JETPCL".

PostScript is an advantageous format for presenting graphics as compared to Printer Control Language. Furthermore, a derivative of

PostScript is utilized by the portable document format (PDF) popularized by Adobe Systems and used with the Adobe programs known as Acrobat and

Acrobat Reader. PostScript formatted forms, therefore, are more readily convertible to Adobe Acrobat format for delivery in PDF form, using library

utilities such as "PDFNET".

Beyond these well-known advantages of PostScript, in

accordance with principles of the present invention, it has been recognized that

PostScript maybe more readily utilized to extract text from a document,

because PostScript formatted documents include a character encoding table that readily permits parsing of the document to identify the graphics therein

that represent text characters. Although Printer Control Language also includes a character encoding table, character encoding in PCL is typically

more difficult to digest and process because the PCL standard permits omission of encoding tables in the document, thus causing each document to have characters that must be uniquely decoded without the benefit of an

encoding table, to determine which characters are presented each point in the

document. Because PostScript character encoding maybe more easily managed, a PostScript document maybe more readily scanned to determine all of the characters utilized in that document.

Specifically, to recognize text in a PostScript formatted

document, the embedded character encoding in the document is identified and

utilized to compare graphical elements in the document to the character encoding, to determine which characters are being presented at each location in the document, after which the words and sentences presented on each page

may be extracted. Such a scanning function is known to those of skill in the

art, and can be accomplished, e.g., by library utilities such as "XPDFTEXT", a

library utility that returns an extracted word from a PDF formatted document each time the utility is invoked.

Thus, the accordance with principles of the present invention,

the text on a form may be extracted and converted to a word list, with each word having a page location, permitting processing of those words to determine words that are customized form data and also permitting those words which are not customized form data to be matched against templates to

determine which form is being utilized. Returning to Fig. 1, when server 10 receives form content such as a form 14, 16 or 18, server 10 stores those received forms in mass storage

11. Forms may be received in PostScript format, or the reception by server 10 may involve conversion of forms received in PCL format to PostScript. Then,

server 10 processes those forms utilizing a program in server 10 and/or template data in mass storage 11, as described in the various embodiments reviewed below. The outcome of this processing is an identification of the

data contained in the forms - such as the lender's and borrower's names and

addresses, financial details of the transaction, and the like. This data is stored in mass storage device 11 in a retrievable data format such an extensible

markup language (XML) format, or more particularly the Mortgage Industry

Standards Management (MISMO) SmartDoc format.

After forms have been processed by server 10, those forms are

available for retrieval by other clients and other client computers. For

example, an attorney involved in the transaction working in an attorney computer 22 may desire to download an XML / MISMO SmartDoc version of

a FfUD-I financing statement, to compare the content of the financing statement to the transaction information that the attorney has generated.

Simultaneously, or alternatively, an attorney may wish to download the HUD- 1 form 14 itself for printing at the attorney's computer 22. Accordingly the attorney may download a version 14' of the HUD-I form that has been

converted by server 10, such as a version converted from PCL to PostScript format, and then to portable document format or PDF.

Similarly, a real estate broker utilizing a broker's computer 24

may desire to retrieve XML or MISMO SmartDoc formatted documents such as 20 including financial details of the transaction or name and address

information, as well as reformatted HUD-I form 14' or other forms uploaded

to server 10. Finally, a buyer utilizing a buyer's computer 26 may receive some or all of the forms generated by the bank home office 12, in this case including not only the HUD-I form 14 but also a mortgage paper or

promissory note 16 and disclosure document 18, all of which may be

reformatted for use with electronic signature technology as shown at 16', 18'

and 20', and discussed further U.S. Patent Application Serial No. 11/076,665, filed March 10, 2005, which is hereby incorporated herein by reference.

Referring now to Fig. 3, the overall process for form

recognition in accordance with the principles of the present invention is

illustrated. A first step 100 in this process is to generate a word list for the

document. Generation of a word list may take a number of forms depending upon the format in which the document is made available. For example, a PostScript formatted document may be decoded in step 102 by utilizing the

PostScript character table and the graphics presented therein to recognize

character graphics within the document and convert those character graphics to text letters. Strings of text letters can then be converted into words, and the

resulting words and locations thereof used for subsequent processing.

Alternatively, the document may be a raster graphic image such as a facsimile

file, in which case the file will not contain a character table that may be used in to recognize characters from graphics. In this situation, character recognition techniques 104 may applied to the document to identify characters, and then

words and their location, for subsequent processing. It will be appreciated that

documents in other formats may also be processed in accordance with principles in present invention. For example, some documents in a PCL

format may include a character table, that may be decoded to identify words and characters therein, although the methodology utilized may differ from the

methodology used for PostScript formatted documents as discussed above.

Following generation of a word list for the document, in step

106, the form recognition algorithm identifies the immutable text and graphic content in the form. This step may take a variety of possible forms depending

upon the specific form at issue and the recognition algorithm utilized. In some algorithms discussed below with reference to Figs. 5 and 6, the process of locating immutable text in the form, includes recognition of the form from among a plurality of candidates. However, in some embodiments of the present invention, the form in use is known in advance, and processing may

therefore be specific to the previously known form. This latter approach is a typical use of the rectangle recognition algorithm of Fig. 4. Accordingly, the process illustrated in Fig. 3 includes an optional step 108 of recognizing the form based upon its immutable content, which is performed only in those

cases where the form is not known in advance.

Following identification of the immutable text of the specific

form in use, in step 110, the mutable content of the form, i.e., the information filled in the blanks of the form, is identified, and its meaning is determined based upon its position relative to the immutable content. Various embodiments of the invention perform this step in different ways. For

example, the mutable text may be recognized by its proximity and positioning

relative to the immutable text that has been recognized in step 106.

Alternatively, or in addition, mutable text may be recognized from its position

relative to graphic content recognized in step 106. After identification of

mutable content, its meaning may be determined based upon its positioning in the form, or any other relationship it has with immutable content. In one

example, the meaning of the mutable content may be determined from the text

that it immediately follows, for example, the name and address of a borrower may follow the immutable text "D. NAME AND ADDRESS OF

BORROWER" as seen in box 34 of Fig. 2A. Alternatively the position of mutable text relative to graphic elements may identify its meaning. For example, in area 42 illustrated in Fig. 2 A, the meaning of mutable text such as

"175,000.000" is determined by the positioning of the box containing it relative to other boxes.

After the meaning of mutable text has been determined in step 112, that mutable text is associated with variable names reflecting the meaning

of the mutable text. Then in step 114, a data file, such as an XML, MISMO

SmartDoc, or other data file is stored containing the mutable text and the

variable names identifying the meanings of the mutable text, for later use as discussed above with reference to Fig. 1.

Referring now to Fig. 4, one embodiment of the present

invention recognizes immutable content of the form by identifying rectangles

therein. In this embodiment, a rectangle recognition algorithm 120 is utilized to divide the surface of the form into rectangles, which are then used for recognition of text and the meaning of that text on the form. Prior to recognizing rectangles, the algorithm of Fig. 4 corrects the positioning of

vertical and horizontal graphic lines on the form to improved the identification

of rectangles on the form. The algorithm described herein utilizes the midline of horizontal and vertical lines to graphically define the position of those lines.

Thus, as seen in Fig. 4A, the algorithm will use the midline 124 of a horizontal line and the midline 126 of a vertical line. It will be appreciated, however, that

graphically defined lines include a finite width as seen at 125 and 127. The finite width of the graphically defined line adds a potential ambiguity and

source of error, specifically with respect to the exact position of the lines relative to each other.

As seen in Fig. 4A, midline 124 is defined to overlap with line 126, within the width 127 of the printed appearance of line 126. Lines 124

and 126 will thus have the printed appearance of T-intersection, however,

midline 124 but it does not intersect at a T-intersection with midline 126. The overlap of midline 124 with midline 126 will not be visible on a printed

document because the graphical width of line 126 and the graphic width of

line 124 as illustrated at 125 and 127 will contain each other, so that the intersection between the two lines will appear to be a proper T-intersection

even though there is overlap.

A similar situation arises when the midline of a graphically

defined line does not intersect to another line at a location that appears to be a

T-intersection. Specifically, a seen in Fig. 4A, midline 124' does not in intersect midline 126' even though the printed width of line 124', as seen at 125', will overlap with the printed width of line 126', as seen at 127'.

A similar adjustment to that described above also needs to be

performed for vertical lines. Referring to Fig. 4B, it can be seen that a vertical line 132 having a printed width shown at 133 may not properly intersect with a horizontal line 134 having a printed width 135. Specifically, line 132 may extend beyond line 134 rather than forming a proper T-intersection, and again,

this inaccuracy may exist even though it is not visible on the printed page for the reason that the printed widths 135 and 133 of the respective lines mask the

excess overlap of line 132 with line 134. Similarly, as seen in Fig. 4B, line

132' does not properly intersect with line 134' shown in Fig. 4B, for the reason that line 132' stops short of line 134'. The improper intersection between lines

134' and line 132' may not be visible because the printed widths 133' and 135' mask it.

The lack of overlap or the excess of overlap, as illustrated in

Fig. 4A, must be corrected prior to identifying rectangles on a form, to

improve the accuracy of that recognition. Accordingly, returning to Fig. 4, in

step 122 a horizontal correction is performed at each line intersection. Specifically,

In the process of step 122, each vertical line is extended by a

predetermined amount e/2 at each end. Then, each horizontal line is analyzed to determine whether its vertical position is within the vertical range of the

extended vertical line. If so, then if an end of the horizontal line is within e/2 of the horizontal position of the vertical line, then the position of that end of the horizontal line is set equal to the horizontal position of the vertical line.

Thus, if a horizontal line has an endpoint close to the horizontal position of a

vertical line, i.e., close to forming a T-intersection with a vertical line, but not exactly, the end point of the horizontal line is changed to equal the horizontal position of the vertical line, so that the end point of the horizontal line exactly

corresponds to the horizontal position of the vertical line. This involves

horizontally extending or shortening the end point of the horizontal line to

match the horizontal location of the vertical line. This compensation adjusts either of the inaccurate intersections illustrated in Fig. 4A to an accurate intersection.

The rectangle recognition algorithm of Fig. 4 also includes step

130 for correcting vertical lines. Specifically, each vertical line is evaluated to determine whether it is close to forming a T-intersection with a horizontal line

at either of its ends. If so, the vertical line is extended or shortened so that the

vertical position of the end of the vertical line matches the vertical position of

the adjacent horizontal line.

In the process of step 130, each horizontal line is extended by a

predetermined amount e/2 at each end. Then, each vertical line is analyzed to determine whether its horizontal position is within the horizontal range of the

extended horizontal line. If so, then if the top of the vertical line is within e/2

of the vertical position of the horizontal line, then the top of the vertical line is set equal to the vertical position of the horizontal line, and, if the bottom of the

vertical line is within e/2 of the horizontal line, then the bottom of the vertical line is set equal to the vertical position of the horizontal line. Thus, if a vertical line has an endpoint close to the vertical position of a horizontal line,

i.e., close to forming a T-intersection with a horizontal line, but not exactly, the end point of the vertical line is changed to equal the vertical position of the horizontal line, so that the end point of the vertical line exactly corresponds to

the vertical position of the horizontal line. This involves vertically extending or shortening the end point of the vertical line to match the vertical location of

the horizontal line. This compensation adjusts either of the inaccurate intersections illustrated in Fig. 4A to an accurate intersection.

Following the adjustment steps described above, the algorithm

of Fig. 4 proceeds to break the form into rectangles to permit subsequent

recognition of text and its meaning, based upon the particular rectangle regions of the form. This process scans in a raster fashion through the form. The initial vertical position at the beginning of the process, is set in step 142 to be

the top margin of the form. Next, in step 144 a data structure is created to

represent an "open" rectangle. Open rectangles are defined by a data structure identifying the left and right edges of the rectangle, and the vertical position at which the rectangle starts. The left, right, top and bottom printable area

margins (hereafter referenced as just the "margins") are treated as rectangle boundaries so that each page will contain at least one rectangle. Boxes within the page will create further rectangles, as discussed below.

Thus, to initiate the process in step 144, an open rectangle data

structure is created, describing an open rectangle extending from the left to the

right margin of the form, and starting at the current vertical position. Next, in

step 145 it is determined whether any non-checkbox vertical lines end(i.e., have their bottom end) at the current vertical position on the form. "Non- checkbox" vertical lines are vertical lines that are not part of checkbox, i.e.,

any of the vertical lines in Fig. 2 A except those that define the checkboxes in

box 32. (Checkboxes are rectangles or squares that have a predetermined maximum size, and are identified in this manner, and ignored during rectangle recognition.) If a non-checkbox vertical line end at the current vertical

position, this indicates that currently open rectangles must be merged to reflect

the absence of the vertical line. Therefore, if there is a vertical line ending at

the current vertical position, then in step 146, the open rectangles that join at

the horizontal position of the ending vertical line are closed, by storing in the data structure representing the rectangles, a closing vertical position that

equals the current vertical position. Next, one rectangle is opened, by forming data structures representing the rectangles in step 147. The rectangle opened

in step 147 has its respective right and left edges at the horizontal positions of the left and right edges, respectively, of the closed rectangles to the left of and to the right of the ending vertical line. Thus, the ending of a vertical line will merge open rectangles by replacing them with one open rectangle beginning at

the vertical position of the end of the vertical line.

After thus creating a new rectangle at the ending position of the vertical line, processing returns to step 145 at which is determined whether there is another vertical line ending at the current vertical position. If so, then

additional processing according to steps 146 and 147 is performed to merge

additional rectangles.

After all ending vertical lines have been handled by steps 145- 147, the next process is to identify in step 148, any non-checkbox vertical lines that start (i.e., have their top end) at the current vertical position on the form.

If a non-checkbox vertical line starts at the current vertical position, this indicates that currently open rectangles must be divided to reflect the presence of the vertical line. Therefore, if there is a vertical line starting at the current

vertical position, then in step 149, the open rectangle that includes the

horizontal position of the starting vertical line is closed, by storing in the data

structure representing the rectangle, a closing vertical position that equals the

current vertical position. Next, two rectangles are opened, by forming data structures representing those rectangles, in step 150. The two rectangles

opened in step 150 have their respective right and left edges, respectively, at the horizontal position of the vertical line, and have their left and right edges, respectively, at the left and right edges of the open rectangle that was closed in

step 149. Thus, the presence of a vertical line will split an open rectangle by

replacing that open rectangle with two open rectangles beginning at the vertical position of the end of the vertical line.

After thus creating new rectangles at the position of the starting vertical line, processing returns to step 148 at which is determined whether

there is another vertical line starting at the current vertical position. If so, then additional processing according to steps 149 and 150 is performed to create

additional rectangles, which may further divide open rectangles into a greater number of rectangles.

When all vertical lines ending and starting at the current vertical position have been processed through steps 145 and 148, then in step

152 the current vertical position is moved down to the next vertical coordinate.

If the bottom margin (end of page or EOP) has not yet been reached, then processing continues to step 154 in which it is determined whether there is a

non-checkbox horizontal line at the current vertical position. If there is a non-

checkbox horizontal line at the current vertical position, then in step 156, all

currently opened rectangles are closed, by marking the data structures representing those rectangles as terminating at the current vertical position.

Processing than continues to step 144 in which a new open rectangle is created

at the current vertical position, extending from the left to right margin.

Thereafter, processing continues to step 146 discussed above, in which any vertical lines at the current vertical position are assessed to potentially break the rectangle created in step 144 into smaller rectangles reflecting the presence

of vertical lines.

Returning to step 154, if it is determined that there is no horizontal line at the current vertical position, then processing proceeds

directly from step 154 to step 146 to evaluate whether there are vertical lines at the current vertical position that will require division of currently opened rectangles into smaller rectangles.

The algorithm described in Fig. 4 proceeds through the form from one vertical position to another until the bottom margin of the form is

reached in step 152. At this point, there are no additional vertical positions to evaluate in the form, and the process of Fig. 4 proceeds to step 160.

Fig. 4C provides exemplary illustrations of the manner in

which the process of Fig. 4 will divide various forms including boxes, into

rectangles. Form 158 illustrated in Fig. 4C includes a single box generally

centered on the form, and thus is a simple example. This form 158 will be divided by the process of Fig. 4 into five rectangles. A first rectangle 158-1 is defined in the area extending from the left, right and top margin to the horizontal line that defines the top of the box in the form 158. Three

rectangles 158-2, 158-3 and 158-4 are defined in the horizontal region where the box is positioned on the form, one to the left of, one to the right of, and one corresponding to the box. Finally, a fifth rectangle 158-5 is defined for the area extending from the horizontal line defining the bottom of the box, to

the left, right and bottom margins.

Form 159 is a more elaborate form that includes 14 adjacent boxes of irregular sizes. This form will be divided by the process of Fig. 4

into 19 rectangles. Notably, as with the simpler form 158, a rectangle will be

defined at the top and bottom of the form, extending from the left to right margins, and rectangles will be defined to the left and right of the boxes,

divided by the horizontal lines. The boxes in the original form correspond to

the 14 rectangles 159-3, 159-4, 159-7, 159-8, 159-9, 159-13, 159-14, 159-15, 159-18, 159-19, 159-20, 159-23, 159-24 and 159-25.

Step 160 and the following steps recognize and associate text in the form with fields, on a rectangle by rectangle basis, thus capturing the data

presented by the form. Specifically, in step 162 a rectangle of the form is

evaluated to identify the meaning of the content in the rectangle, by evaluating

whether there is immutable content contained the rectangle, i.e., text or a

graphic that is a known field identifier. For example, the rectangle may

\ include the known field name "D. NAME AND ADDRESS OF BORROWER:" If the rectangle is thereby identified, the rectangle may also contain mutable text which identifies the values for a field.

As noted above, the relationship between values and the

immutable text in a rectangle may not always be simple - a social security

number may be included in a rectangle having the immutable text "G. PROPERTY LOCATION" and a disbursement date may be included in a

rectangle having the immutable text "I. SETTLEMENT DATE". The process of steps 160 and the following must be designed to flexibly determine the mutable text once the immutable text has been identified, by for example

recognizing a social security number as distinct from a name or address based upon its sequencing of digits and dashes.

In the case where there is mutable text in a rectangle,

processing continues to step 164 and then to step 166 in which any text within

the rectangle is extracted and assigned to the appropriate variable or variables,

as identified by the immutable text or graphic in the rectangle identified in step 162, and if necessary by the nature of the mutable text (e.g., formatted as a social security number) or by the positioning of mutable text in other

rectangles.

In some cases, a rectangle may not contain mutable text, but may include a reference to an adjacent rectangle containing mutable text. For example, the rectangle may contain the immutable text "101. Contract Sales Price", and be adjacent to another rectangle which contains the dollar figure

for the contract sales price, hi this case processing continues through step 168 to step 170 in which the text from the appropriate adjacent rectangle, selected based upon the immutable text, is extracted and assigned to the appropriate variable, e.g. the variable that is identified by the immutable text in the current

rectangle.

Some rectangles may contain check boxes, and the processing of such rectangles in steps 164 and 166 is slightly different, hi this case the

mutable text is analyzed to determine whether there is mutable text or graphic

(e.g. and "x" or check symbol) positioned inside one or more of the

checkboxes, and if so, then the immutable text adjacent to that checkbox is used to identify the meaning of the selected checkbox(es).

Some rectangles may not contain any text or may not contain mutable text which is associated with a value. For example, the rectangle may

be between boxes of the form and the margins. Or the rectangle may be part

of a box of the form, which only has the title of the form, e.g., "A. U.S.

DEPARTMENT OF HOUSING & URBAN DEVELOPMENT

SETTLEMENT STATEMENT", hi one embodiment, such content may be used to recognize the form, but beyond this use, such rectangles may be ignored in the processing of steps 162 to 170 because they do not contain mutable text.

The processing of steps 162 through 170 is performed for each

rectangle identified on the form via the process of Fig. 4, after which all of the mutable text on the form that is associated with variable names, will be captured and stored for later delivery to clients, as discussed above with reference to Fig. 3.

The process described above with reference to Fig. 4 analyzes a

form by identifying rectangles on the form and thereafter identifying

immutable elements and mutable text in those rectangles, using the immutable text as a guide to the location and meaning of the mutable text. While this is

an efficient process for forms that include rectangles, an alternative process for

identifying mutable and immutable content of a form is required for form

documents such as those identified at 16 and 18 in Fig. 1, which do not include boxes that can be used for recognition.

One such alternative process, in accordance with principles of the present invention, is described in Fig. 5. The process of Fig. 5 extracts a

word list from an unknown received document, and compares it to the word

list from known forms, and uses the sequence of matching (immutable) and

nonmatching (mutable) words to determine which of the known forms is the closest match, and identify the meaning of the mutable and immutable text and data on the form.

In the process of Fig. 5, in a first step 200 a word list, to be identified as "TEST", is extracted from the document to be matched. In the

case of a PostScript or PDF formatted source document, this step typically involves using a library function that uses the character map in the document

to recognize graphical patterns that correspond to individual characters, and then recognize sequences of such characters as words at particular positions on the document. The words on the document are then converted to a word list

by a raster scan of the document, left to right and top to bottom.

It will be appreciated that principles of the present invention may be applied to word lists generated in any manner, and to documents in multiple formats. For example, an ASCII or other standard word processing

format documents may be converted to a word list without requiring recognition of characters. Fax-formatted or other graphics formatted

documents may also be converted to word lists by optical character recognition (OCR) techniques and OCR library utilities known to those of skill in the art.

It will be appreciated, as discussed above with reference to Fig. 2B, that incoming documents may be multi-page, and those multiple pages

may constitute a single form or may include multiple forms in a single stream, requiring separation and individual matching of pages to known forms. In such circumstances, the wordlist (i.e., "TEST") extracted from the incoming document, may be initially generated from only the first page only of a multi- page print stream or document. If a word list extracted only from the first page

does not adequately match any known forms, then a word list extracted from the first two pages may be compared to known forms, and so on. Once a set of pages in an incoming document is successfully matched to a known form, then the remaining pages may be processed by the same procedure, starting with a

single page, then multiple pages.

Following extraction of a wordlist from an incoming document,

or a page or pages thereof, this wordlist is compared to wordlists from each of several candidate forms that may be matchable to the incoming document.

These wordlists are generated in advance, for example by causing the source

client system to output a "blank" form, i.e., a form that contains all of the immutable content of a form, but does not contain any mutable content, and then converting the resulting "blank" form to a wordlist. Known wordlists

may also be generated manually by reviewing each form used by an organization and producing an optimal wordlist for matching to all known

versions of a form. Finally, known wordlists may be generated by causing a

client system to output wordlists for multiple versions of a given form (i.e.,

using different mutable content typical of different transaction types that use

the form), and then identifying the largest common subsequence of those wordlists by repeated application of a process such as that discussed below with reference to Fig. 6. This last approach to generating known wordlists

offers the potential for periodic automatic updating of known wordlists in

response to changes in forms at the client system. Specifically, the server 10 may retain several recent versions of a recognized form, e.g., as those forms

are forwarded through server 10, and periodically use the retained recent versions to rebuild a known wordlist representing the largest common subsequence of words in those forms, which can replace the known wordlist currently in use if any changes are noted.

The comparison of the incoming, unknown wordlist TEST to

known wordlists is performed one wordlist at a time. Specifically, in step 202,

a known wordlist "KNOWN" that has not previously been evaluated, is selected from the available pool of known wordlists. Next, in step 204, the

pointer variables x and y are initialized to zero and a temporary file is created for storing a matched version of the incoming wordlist TEST. Then, a loop of

steps 206-214 is entered, which collectively compare words in TEST to words

in KNOWN. In step 206, a current word in TEST (at position x) is compared to a current word in KNOWN (at position y). If these two current words

match, then in step 208 the pointers x and y are incremented (skipping to the

next words in TEST and KNOWN) and the current word in TEST is stored in the temporary file as a matched, apparently immutable word. If, however, the current words do not match, then in step 210 the pointer x is incremented (skipping to the next word in TEST), and the current word in TEST is stored in the temporary file as an unmatched, apparently mutable word. Thereafter, in steps 212 and 214 it is evaluated whether the end of the TEST or KNOWN wordlists have been reached, respectively, and if not, processing returns to step

206 to check for a match between the then-current words in TEST and

KNOWN.

When the end of either the TEST or KNOWN wordlists is reached, processing will proceed from either step 212 or 214, respectively, to step 216. At this step, the pointer y will identify the number of words in

KNOWN that were successfully matched to TEST. If the number of words y

matched to the KNOWN under evaluation is the largest number achieved so far, then in step 218 the temporary file storing the matched version of TEST, is stored as a current best match. In any case, in step 220, it is determined

whether further known wordlists are available for evaluation. If so, then in

step 222 the temporary file is erased for the next iteration, and processing

returns to step 202 to select another candidate known wordlist and evaluate its

match to the unknown wordlist TEST.

After all candidate known wordlists have been evaluated, processing continues from step 220 to step 224, in which the best match file

generated during the previous efforts is evaluated and compared to criteria that establish when a "match" is considered accomplished. These criteria may, for

example, require that the matched wordlist include a limited number of unmatched / apparently mutable words, or include matching for all or

substantially all of the words in the known wordlist. If these criteria indicate a success, then processing continues to step 226, and the mutable text is extracted from the best match file, and its meaning recognized based upon its proximity to immutable text, in a manner analogous to the processing

discussed above with reference to Fig. 4, steps 162-170.

In the event that the best matching known wordlist does not

meet the criteria of step 224, a matching failure is returned in step 228. This may end the matching process, or as discussed above may cause the matching

process to restart, using a TEST wordlist created from a series of the pages of

a received stream beyond those used in the initial evaluation.

An example demonstrating the results of the process of Fig. 5 follows. In this example, the wordlist to be tested includes the following:

TEST=Nanie: Tim Ext.: xl23 Title: Programmer Spvsr.: Phil

Three known wordlists are matched to this test word list, as follows:

KNOWN #1 =Name: Ext.: Fax: Home: Title:

KNOWN #2 =Name: Ext: Fax: Home: Title: Sprvsr.: KNOWN #3 = Name: Ext. : Title: Matching proceeds according to Fig. 5, resulting in the following temporary files for each known word list (underlining identifies words matched from the known word list):

Result #1 = Name: Tim Ext.: xl23 Title: Programmer Spvsr.: Phil

Matches = 2

Result #2 = Name: Tim Ext.: xl23 Title: Programmer Spvsr.: Phil

Matches = 2

Result #3 = Name: Time Ext.: xl23 Title: Programmer Spvsr.: Phil

Matches = 3

It may be observed from this example that the matching process of Fig. 5 successfully identifies that the immutable word "Name:" in each known wordlist is matchable to the test word list, and furthermore that this

immutable word is followed in the test word list by a mutable word "Tim"

(which would be implied to be a name). Furthermore, after the word "Tim",

each known word list includes the immutable word "Ext.", which is matched by the process of Fig. 5 to the word "Ext.:" in the TEST word list. The TESTt

word list includes the mutable word "xl23" after "Ext." (which would be implied to be a telephone extension). Furthermore, it can be seen that the third

known word list includes the mutable word "Title", which is also matched to

the test word list. After "Title" the TEST word list includes "Programmer

Spvsr. Phil", which would be implied to be a title. Thus, the matching algorithm is capable of identifying mutable and immutable text and permitting

extraction of immutable text and identification of the meaning of that immutable text from proximity to mutable text.

However, it may also be observed that the process of Fig. 5 fails to identify the best match of the three known word lists to the TEST word list. Specifically, the best match of the three known word lists is KNOWN #2,

which not only includes immutable text for "Name:", "Ext." and "Title" (as is

the case with KNOWN #3), but also includes immutable text for "Spvsr.".

The process of Fig. 5 fails to match "Spvsr." in the TEST word list to KNOWN #2, for the reason that KNOWN #2 includes an immutable word

"Fax:" which is not included in the TEST wordlist. The omission of the

immutable word "Fax:" from the TEST wordlist prevents matching of the test

wordlist to KNOWN #2 beyond the immutable word "Ext." when using the process of Fig. 5, and thus prevents the best matching known wordlist from being identified.

This sort of difficulty, created by word deletions, is not

necessarily unusual in practical examples. While documents created from

forms typically include all of the immutable content of the "blank" form, plus additional mutable content, it is typical for forms to be updated from time to

time, and this updating may involve deleting immutable content from the

form. E.g., a field may be deleted or the immutable words in that field may be simplified by deletion of words. Furthermore, in some circumstances a "blank" form, rather than being blank, may include placeholder words at the

location of mutable content, to provide for easier identification of the locations where mutable content is inserted. The "blank" form could for example

include placeholder words (e.g., unique gibberish words such as zzyzz, zyzzz, etc.) at each location of mutable text. (This approach might help to prevent immutable words of a known document from being confused with mutable words of a received document - for example, the last name of the borrower identified on a mortgage transaction form might be "Borrower", a mutable

word that might be matched to an immutable word "Borrower" on a "blank"

form. Such is less likely if the form, instead of being "blank", includes

gibberish words that can be picked up by the matching algorithm.) If a "blank" form includes them, placeholder words would appear at the locations of

mutable text in a wordlist made from the "blank" form, and would not match

to a TEST wordlist extracted from a received document.

Thus, in some circumstances, it may be advantageous to provide an algorithm capable of identifying not only additions to forms, but also deletions.

One such process is the common subsequence matching

process illustrated in Fig. 6. This process is used in a manner similar to the

process of Fig. 5, but performs a more sophisticated analysis of a TEST and KNOWN wordlist to uniquely identify the longest subsequence of common

words in the two wordlists. This is accomplished, as in Fig. 5, by first extracting the TEST wordlist from the received, unknown document (step 250), and then selecting a KNOWN wordlist from the available pool (step 252).

The process of Fig. 6 uses two tables to store located

subsequence information; each table has a number of rows equal to the number of words in TEST (which is stored as "m" in step 254), and a number of columns equal to the number of words in KNOWN (which is stored as "n" in

step 254). The first table, known as bTABLE (which is dimensioned in step

256), provides data indicating the words in the longest common subsequence,

and the second table, known as cTABLE (also dimensioned in step 256), tracks the length of the longest subsequence that has been identified at a

particular time during execution of the algorithm. The content of cTABLE

and bTABLE, taken together, thus fully characterize the location and length of

the subsequences found in TEST and KNOWN.

In step 258, the cTABLE is initialized by setting the values in its 0^th row and 0^th column to 0 (representing that no common subsequences are

known at the start of processing), and in step 260, the loop variables i and j are

initialized to values of 1 , so that processing of KNOWN and TEST begins at the first word of each. Processing in Fig. 6 proceeds through a douhle loop of steps

262 through 276, in which each word in TEST is compared sequentially to the word in the corresponding and later positions in KNOWN. By thus

proceeding systematically through a comparison of KNOWN and TEST, any common subsequence of KNOWN and TEST is reliably identified and captured by the information in cTABLE and bTABLE.

In a first step 262, it is determined whether the current word in

TEST (at position i) is the same as the current word in KNOWN (at position j). If so, then any common subsequence that has previously been identified in

the parts of TEST and KNOWN that precede words i and j, respectively, can be extended by one word, to include the matching words i and j. Accordingly,

in step 264, the value in cTABLE at (i-lj-1), which represents the length of

the longest subsequence found up to words i-1 and j-1 in TEST and KNOWN, is incremented by 1, and the result is stored in cTABLE at entry (i,j). Entry (i,j) in cTABLE thus reflects the length of the longest subsequence found, up to words i and j of TEST and KNOWN. Also in step 264, the bTABLE entry

at (i,j) is updated to store the value "\", a symbol that represents that words (i,j)

of TEST and KNOWN are matching words (the manner in which the symbols

of bTABLE interplay to identify subsequences will be appreciated from the

examples provided below). In the event that word i of TEST and word j of KNOWN do not match in step 262, processing continues to step 266, in which a test is

performed to determine the longest previously-identified common subsequence of TEST and KNOWN. Specifically, in step 266, the value of

cTABLE at (i-lj) is compared to the value of cTABLE and (ij-1). This tests determines whether a longer subsequence was found in the immediately

preceding comparison of word i-1 of TEST to word j of KNOWN, or in the preceding (and earlier) comparison of word i of TEST to word j-1 of

KNOWN. In the event that the cTABLE entry at (i-lj) is greater than or equals the cTABLE entry at (ij-1), this indicates that the longest subsequence

was found in the preceding comparison of word i-1 of TEST to word j of

KNOWN, and in this case in step 268 the value of cTABLE at entry (i-lj) is stored in cTABLE at the entry (i j), thus reflecting that the longest

subsequence has the same length after word i of TEST as before word i of TEST (because word i of TEST did not match word j of KNOWN). Also in

step 268, the symbol "^Λ" is placed in bTABLE at entry (i j), reflecting that the longest known subsequence at word i of TEST and word j of KNOWN

continues from word i-1 of TEST and word j of KNOWN.

If the comparison of step 266 is false, this indicates that the

longest subsequence was found in the preceding comparison of word i of

TEST to word j-1 of KNOWN, an in this case in step 270 the value of cTABLE at entry (ij-1) stored in cTABLE at entry (ij), thus reflecting that the

longest subsequence has the same length after word j of KNOWN as before word j of KNOWN (because word i of TEST did not match word j of KNOWN). Also in step 270, the symbol "<" is placed in bTABLE at entry (i j), reflecting that the longest known subsequence at word i of TEST and

word j of KNOWN continues from word i of TEST and word j-1 of KNOWN.

After step 264, 268 or 270, in step 272 it is determined whether the end of the TEST wordlist has been reached. If not, then the pointer i is incremented in step 274, to begin comparison of the next word of TEST with the current word of KNOWN in the above-described manner, and processing

returns to step 262.

When the end of the TEST wordlist is reached, processing continues to step 276, at which it is determined whether the end of the KNOWN wordlist has been reached. If not, then the pointer j is incremented in step 278, and the pointer i is reset to 1, to begin comparison of the first word

of TEST with the next word of KNOWN in the above-described manner, and processing returns to step 262.

When the end of the KNOWN wordlist is reached, the double loop of steps described above is complete, and processing continues from step

276 to step 280. In step 280 the cTABLE and bTABLE are evaluated to determine the quality of the match between TEST and the currently selected KNOWN wordlist. Specifically, the largest value in the cTABLE represents

the length of the longest matches subsequence of TEST and KNOWN, and is a good indicator of the quality of match between TEST and the current KNOWN. If in step 280 the largest value in the cTABLE is larger than the

current best match, then in step 282 the cTABLE and bTABLE are stored as the current best match. After these steps, in step 284 it is determined whether there are further known wordlists to compare to the TEST wordlist, and if so,

processing returns to step 252 to select a remaining known wordlist for

comparison. After all candidate known wordlists have been evaluated,

processing continues from step 284 to step 286, in which the best match

example, require that the matched wordlist include a limited number of unmatched / apparently mutable words, or include matching for all or substantially all of the words in the known wordlist. If these criteria indicate a

success, then processing continues to step 288, and the mutable text is

extracted from the best match file, and its meaning recognized based upon its

proximity to immutable text, in a manner analogous to the processing

discussed above with reference to Fig. 4, steps 162-170. In the event that the best matching known wordlist does not meet the criteria of step 286, a matching failure is returned in step 290. This may end the matching process, or as discussed above may cause the matching

process to restart, using a TEST wordlist created from a series of the pages of a received stream beyond those used in the initial evaluation.

An example demonstrating the results of the process of Fig. 6 follows. In this example, the wordlist to be tested includes the following:

TEST=Name: Tim Ext.: xl23 Title: Programmer Spvsr.: Phil Three known wordlists are matched to this test word list, as follows:

KNOWN #1 = Name: Ext.: Fax: Home: Title:

KNOWN #2 = Name: Ext.: Fax: Home: Title: Sprvsr.:

KNOWN #3 = Name: Ext.: Title:

Matching proceeds according to Fig. 5, resulting in the following cTABLE and

bTABLE results, indicating the identified subsequences each known word list

(underlining identifies words matched from the known word list):

KNOWN #1 = Name: Ext.: Fax: Home: Title:

TEST= Name: Tim Ext.: x!23 Title: Programmer Spvsr.: Phil

Result #1 = Name: Tim Ext.: xl23 Title: Programmer Spvsr: Phil Matches = 3

KNOWN #2 = Name: Ext.: Fax: Home: Title: Sprvsr.:

TEST= Name: Tim Ext.: xl23 Title: Programmer Spvsr.: Phil

Result #2 = Name: Tim Ext.: xl23 Title: Programmer Spysr.: Phil Matches = 4

KNOWN #3 = Name: Ext.: Title:

TEST= Name: Tim Ext.: xl23 Title: Programmer Spvsr.: Phil

Result #1 = Name: Tim Ext.: xl23 Title: Programmer Spvsr: Phil Matches = 3

It may be observed from this example that the matching process

of Fig. 6 successfully identifies more matches in known wordlists than the process of Fig. 5. Specifically, the matching process of Fig. 6 not only

determines that the immutable words "Name:", "Ext." and "Title" of the third word list are matched to the test word list, but furthermore, the matching process of Fig. 6 determines that KNOWN #2 is the best match of the three known word lists to the TEST word list by matching not only "Name:", "Ext." and "Title", but also matching "Spvsr.". The process of Fig. 6 thus matches "Spvsr." in the TEST word list to KNOWN #2, even though KNOWN #2

includes the word "Fax:" which is not included in the TEST wordlist. The omission of the immutable word "Fax:" from the TEST wordlist, or other deletions, does not prevent matching of the test wordlist to words beyond the deleted words in the known wordlist, and thus permits the best matching

known wordlist to be identified.

While the present invention has been illustrated by a

description of various embodiments and while these embodiments have been described in considerable detail, it is not the intention of the applicants to restrict or in any way limit the scope of the appended claims to such detail.

Additional advantages and modifications will readily appear to those skilled in the art. The invention in its broader aspects is therefore not limited to the

specific details, representative apparatus and method, and illustrative example shown and described. Accordingly, departures may be made from such details

without departing from the spirit or scope of applicant's general inventive concept.

What is claimed is:

Claims

1. A method of recognizing the content of a document as part of

electronic delivery thereof, comprising generating a word list for said document, recognizing immutable content of said document that correspond to a form used in generating the document,

identifying meanings for mutable content included in said word list based upon positions of words in said word list relative to recognized immutable content, storing said mutable content in association with identified

meanings thereof for subsequent retrieval.

2. The method of claim 1 wherein said immutable content comprises words in said word list.

3. The method of claim 1 wherein said immutable content

comprises graphical content in said document.

4. The method of claim 1 wherein said document is graphically described.

5. The method of claim 4 wherein generating said word list comprises one or more of: matching a character map embedded within said

document to graphical content of said document and performing character recognition upon said graphical content.

6. The method of claim 1 wherein said immutable content is identified by comparison of said word list to a word list of a known document.

7. The method of claim 6 wherein said document is recognized as the product of a form upon recognition of immutable content therein as

matching a known document descriptive of the form.

8. The method of claim 7 wherein the known document

descriptive of the form is generated by identification of common word

subsequences in a plurality of documents generated from the form.

9. The method of claim 6 wherein said immutable content is

identified by identification of common word subsequences in said word list

and a word list of a known document.

10. The method of claim 6 wherein said immutable content is identified by comparison of said word list to word lists of a plurality of known documents.

11. The method of claim 1 wherein said mutable content is

stored in one or more of an XML or MISMO SmartDoc format.

12. The method of claim 1 wherein said immutable content is

identified by recognizing graphical boxes included in said document.

13. The method of claim 12 wherein said graphical boxes are

recognized by dividing said document into rectangles consistently with the positions of graphic lines in said document.

14. The method of claim 13 wherein prior to recognition of said

graphical boxes, intersections between horizontal and vertical lines in said document are corrected to create T intersections between horizontal and vertical lines when a horizontal or vertical line ends near to but not at a

vertical or horizontal line, respectively.