WO1999008226A1 - Retrospective conversion - Google Patents
Retrospective conversion Download PDFInfo
- Publication number
- WO1999008226A1 WO1999008226A1 PCT/NL1998/000452 NL9800452W WO9908226A1 WO 1999008226 A1 WO1999008226 A1 WO 1999008226A1 NL 9800452 W NL9800452 W NL 9800452W WO 9908226 A1 WO9908226 A1 WO 9908226A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- database
- document
- word
- keywords
- words
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/26—Techniques for post-processing, e.g. correcting the recognition result
- G06V30/262—Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Definitions
- the present invention relates to a method and device for converting information arranged on paper to an electronically structured data file or database. This conversion is also referred to as retrospective conversion.
- Paper data files have been built up in the past for different purposes such as for instance library catalogues, birth registers, criminal records and the like, wherein large quantities of information are rendered on paper.
- the information is herein often arranged in handwritten or typed form or in the form of stamps and the like.
- the object of the present intention is therefore to provide a method and device with which written or typed information can be stored, structured in correct manner, in an electronic database.
- the invention therefore relates to a method for linking written or typed information in a document to database information from a database, comprising of:
- the present invention also comprises a device for linking written or typed information in a document to database information from a database, comprising:
- - keyword selection means for selecting keywords from the converted document ;
- - keyword list means for drawing up a key part-word list consisting in each case of a number of successive characters of the keywords;
- - database word list means for drawing up a database word list of the database words occurring in the database
- database part-word means for drawing up a database part-word list consisting in each case of the above stated number of successive characters of words from the database word list, wherein each database part- word contains a reference to the database word of which it forms part;
- comparing means for comparing the key part-words from the document with the database part-word list
- database word selection means for selecting, on the basis of the comparison, the database words corresponding with the keywords
- FIG. 1 is a block diagram of a device for converting written or typed information into a structured electronic data file
- - figure 2 is a block diagram of the method for converting written or typed information into a structured text, without the text having been corrected for errors;
- - figure 3 is an example of a library card on which retrospective conversion must be performed;
- - figure 4 is an example of the same library card after performing segmentation of the fore- and background;
- - figures 5a and 5b show another example of performing segmentation of the fore- and background;
- FIG. 6 is an example of the same library card after performing image processing steps a to j ;
- FIG. 7 is a block diagram showing the method for indexing an external database
- FIG. 8 is a block diagram which shows schematically the method for selecting keywords
- FIG. 9 is a block diagram showing the method of direct linking of documents to records of a comparison database
- FIG. 10 is an example of the comparison of trigrams, i.e. comparison of combinations of three successive characters
- FIG. 11 is a block diagram showing the method of indirect linking of documents to records of a comparison database.
- the preferred embodiment of the present invention relates to the conversion of data on library cards to an electronic database .
- manual conversion to a database is too costly and too time-consuming.
- Automatic conversion is on the other hand very difficult in view of the wide variety in structure, i.e. information such as title, author, location and the like is not positioned in an unambiguous position on the library card.
- the variety in handwriting, variety in typefaces (fonts) used and possible tarnishing of the card such as by coffee stains and the like can also hamper the automatic conversion.
- Figure 1 shows a preferred embodiment of a device with which the conversion can be effected.
- the device comprises inter alia a scanner 1, with which the library cards can be read in, a connection 2 to an external database 5, a storage medium 3 for storing data and a computer 4 for controlling the data conversion.
- Figure 2 shows a block diagram representing the method for converting written or typed information into a form readable for the computer. Linking to an external database has not yet taken place herein.
- a library card is characterized by the following three characteristics: the card has a title description in which information is included concerning the title, author and edition of the relevant book, which information relates only to the document to which it refers and not to the specific location where the document can be found, a signature in which information is included concerning the specific location in the library where a copy of the document can be found, and a logical structure which indicates that the title description of the library card is divided into segments designating the different types of Bibliographical information.
- the documents or library cards are converted into digital images using a scanner. These digital images can be binary, i.e. black-and-white values only, or can contain colour and/or grey tones. In order to carry out effective image processing at a later stage, digital images with grey tones and or colour tones are preferably generated. This is however not essential for the method of the present invention.
- b. Foreground-background segmentation The quality of the digital image is subsequently enhanced by means of image enhancement techniques, such as background-foreground segmentation, wherein the text (foreground) on a card is separated from other information (background) , such as for instance coffee stains, colours and patterns in the background and the like.
- Figure 4 shows the result of the image enhancement by foreground-background segmentation. Background information in the form of edges of the stamp of the signature is herein removed from the digital image of the library card. Shown in figures 5a and 5b is another example wherein the background pattern of the image is removed from the original image by segmentation.
- the digital image is then subjected to automatic extraction and marking of logical components in the library card.
- logical components are parts of the image deemed as a meaningful unit or entity by a normal user.
- Logical components in a scientific book are for instance notes, chapter titles, paragraph titles, footnotes and the like.
- the purpose hereof is to code each library card, hereafter also referred to as document, in accordance with a given definition of the document type (DTD) as according to SGML coding, which definition of the document type describes the logical structure of the document, in this case the library card itself.
- DTD document type
- SGML coding which definition of the document type describes the logical structure of the document, in this case the library card itself.
- a macro object is built up of various micro objects or macro objects, for instance the macro object word is built up of various characters, the macro object line is built up of various words, etc.
- a series of image processing steps is performed to extract the connected components and their characteristics, followed by marking steps in which components are grouped into macro objects and are marked as SGML elements , which will be described hereinbelow .
- micro object can be seen as a single spot of connected pixels surrounded by "open" space (white space) .
- An example of a micro object is a single letter "e”.
- the letter "i" is built up of two objects, i.e. the dot and the rest of the letter. Each object will be described by a frame formed by the left and top coordinate of the object and the width and height of the object.
- the result of micro object extraction is a list of micro objects. d. Histogram analysis
- domain-independent knowledge is for instance the fact that it is known a priori that words and sentences in a particular language are defined from left to right . It is therefore important to extract the relevant information from the document itself. Histogram analysis can be used for this purpose.
- All objects determined during the micro object extraction are used in making histograms for the coverage, which is defined as the percentage of black pixels within the frame of an object, the height, which is defined by the height of the frame of an object, and surface area, defined by the surface area of the frame of an object. Histograms are also made of the entire document (thus, all objects together) . e. Marking of micro objects
- This process is based on algorithms each making use of the statistical data originating from the histogram analysis.
- a distinction is herein made between text objects and other objects, such as for instance photo, table or graphic objects.
- the detection of these object is performed with a decision tree wherein use is made of various parameters. For each object is described which parameters are important and how these can be used to separate the different objects.
- the parameters used are for instance: left-hand position of the object, top position of the object, width of the object, height of the object, width/height ratio of the object, area of the object, object pixel, coverage (number of object pixels within the frame divided by the total number of pixels within the frame) .
- Photographs can for instance be detected because the surface area of the object is large, the coverage is high and the width/height ratio is about 0.05 to 5.
- Text can for instance be detected due to the small size and width.
- the heights are compared with the average height of all objects on the library card. Width is not a very reliable parameter because letters may still be joined together after having been scanned. f. Straightening documents
- micro objects such as letters are grouped into macro objects, such as into paragraphs.
- knowledge about the logical structure of the library card will be used together with the results of the histogram analyses.
- Documents are usually written in horizontal and vertical direction. By now projecting all frames on a horizontal and vertical axis, determined areas in the documents running horizontally or vertically can be found in which no micro objects are situated (so-called "white rivers") . If the areas are wide or high enough, a title or column can for instance be detected. h. Marking of macro objects
- optical character recognition is the process of reading a collection of pixels and conversion thereof into letters
- ASCII standard optical character recognition
- the optical character technique also provides knowledge concerning the typeface (font) , the character size and style, which information can be used again at a later stage.
- the method for comparing the results of steps of the LSD analysis to the external database depends on the possibility of performing standard database operations on the external database. If this is not the case, no direct comparison can therefore be made. The comparison can however take place in indirect manner. On the basis of a list of all words occurring in the external database, a library lexicon is made whereby the LSD results are compared via the library lexicon instead of the external database itself.
- Figure 7 shows in a block diagram the method for indexing the external database, or comparison database .
- the words of the comparison database occurring in the database records are first lemmatized 5. This means that the words present in the database records are reduced to their lemma. This takes place for instance by comparing each word of the comparison database to a word list in which relations between words and their lemmas are recorded.
- table 1 an example is given of a comparison database of two records. Record I :
- the records of the comparison database, whether or not lemmatized, are then indexed in the following manner: A.
- file inversion 6 wherein an inverted file is created on the basis of the records of the comparison database, whether lemmatized or not.
- the inverted file is referred to as lemma-based-inverted-file (LBIF) 7 and contains an alphabetically ordered list of lemmas, wherein each lemma contains a reference to all database records in which the words from which they are derived occur.
- Table 3 shows the result of inverting the two lemmatized records of the comparison database of table 2.
- the LBIF contains all lemmas of the comparison database alphabetically ordered, wherein each lemma contains a reference to each record of the database in which it occurred.
- trigram index 9 is made on the basis of this inverted file.
- the trigram index contains an alphabetically ordered list of trigrams, wherein each trigram consists of three letters originating from the lemmas from the lemma-based- inverted- file and wherein references are included to all locations of the trigram in the lemma-based- inverted- file .
- Table 4 shows the trigram index on the basis of the LBIF of table 3. *ae : 1 *am : 2
- vector space modelling 10 wherein a vector space index (vector space model VSM index 11) is made on the basis of the records, whether or not lemmatized, of the comparison database.
- the VSM index 11 contains per record 16 coordinates to describe the characteristics thereof.
- the degree of similarity between the VSM index 11 (originating from the comparison database) and the title description of the library card can then be determined.
- a weighting can also be performed which for instance takes into account the number of times a lemma occurs in a record. Weighting factors can also be determined on other grounds and vary for instance from 0 to 1 depending on the degree to which lemmas are identifying for the records . In the example it is assumed for the sake of simplicity that no weighting is performed on the coordinates, which does not however detract from the method according to the present invention.
- FIG. 8 is a block diagram in which the method for selecting the keywords from all title descriptions is shown. For efficiency reasons only a limited number of words are selected for comparing with the comparison database . All input records, or title descriptions, are inverted 12, i.e. all words from all title descriptions are placed in an alphabetically ordered list 13 with reference to all title descriptions in which these words occur. The words of the title descriptions are preferably lemmatized herein, although this is not essential. The frequency and distribution in the different title descriptions of each word in the alphabetically ordered list is then determined 14.
- Determining of the frequency and distribution can take place in very many different ways (see for instance ⁇ 14,5 of said book by Frakes and Baeza-Yates) .
- the result of the method is an alphabetically ordered list 15 of keywords occurring in all title descriptions wherein a numeric value is added to each word which is a measure for its significance.
- the alphabetically ordered keyword list 15 only contains words of this one title description.
- performing of statistical operations is more useful.
- frequently occurring words which are less suitable for functioning as keyword, will automatically acquire a lower degree of significance.
- Table 6 shows such a list for the present example, wherein for each keyword a degree of significance is also given.
- the term “Anisterdarn” has the greatest infrequency and is therefore selected as the first, most significant keyword.
- the words of the input records, or title descriptions, are lemmatized, which results in the terms “bedrijven” and “Gebroeders” being lemmatized to respectively "bedrijf” and “broeder” .
- Figure 9 is a block diagram showing the method for comparing the title description with the external database and linking the title description to a record from the comparison database.
- the title descriptions in the input records are compared one by one with the information from the comparison database.
- From the title description the most infrequent n keywords are selected 16 on the basis of the keyword list with statistical information, wherein the number of keywords n can be varied and is determined in practice.
- the keywords are then inputted into the trigram comparison module 17.
- the keyword "Anisterdarn” is chosen as first input into trigram comparison module 17.
- the keyword is divided into trigrams consisting of three letters (wherein the character "*" represents a random character from the alphabet) .
- the trigrams of the keyword words are subsequently compared to those of the above described trigram index 9.
- Figure 10 indicates how the comparison is effected for the present example.
- Four of the ten trigrams of "Anisterdarn” correspond with trigrams from trigram index 9 which have a reference to the second word of the LBIF 7.
- the second word in LBIF 7 is in this case "Amsterdam” .
- This number is greater than the number of corresponding trigrams in the other words of LBIF 7, so that in this case it is decided that the word "Anisterdarn” in the title description for recognizing corresponds with the word "Amsterdam” .
- the decision can also be taken on other grounds.
- a higher significance of particular trigrams relative to other trigrams can for instance be taken into consideration.
- the trigram "tje" for instance occurs so frequently in Dutch that a lower significance can be assigned to this trigram than to another less usual trigram.
- the trigram comparison 17 is repeated for all keywords.
- the result is a list of keywords (or lemmas) from the LBIF corresponding to the keywords, wherein these keywords from the LBIF in turn contain a reference to the records in the comparison database in which they are situated.
- the keywords (lemmas) from LBIF 7 corresponding with the keywords from the title description are "matched" in the VSM module 18 with the previously constructed VSM index 11 of the comparison database.
- the title description is provided with coordinates in analogous manner as specified above in respect of the VSM index 11.
- the degree of similarity between a record j of VSM index 11 and title description can be determined in various ways, for instance as follows:
- the calculation of the similarity is repeated for all records of VSM index 11.
- the record 19 from VSM index 11 with the greatest similarity is then selected as being the record which most probably corresponds with the title description of the library card.
- the library card is therefore linked to this record of the comparison database.
- Figure 11 shows in a block diagram the method applied for indirect comparison with the comparison database.
- All input records, or title descriptions are inverted, i.e. all words of the title description are placed in an alphabetical list with reference to all input records, or title descriptions, in which these words occur, whereby a file index results.
- a file index results.
- the significance of each word and of each term in the file index is then determined.
- the input records are subsequently fed one by one to a keyword selection module 16.
- the keyword selection module selects in the manner as described above the n most relevant keywords of the input record.
- Via a trigram comparison 17 on the basis of the trigram index 9 these keywords are compared with said library lexicon 20.
- the keywords resulting from this comparison which are spelled correctly owing to the trigram comparison, are subsequently used to retrieve 21 the corresponding database records 19 of the comparison database 5 via the interface 22 of the external comparison database 5.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP98939011A EP1004090A1 (en) | 1997-08-11 | 1998-08-07 | Retrospective conversion |
AU87525/98A AU8752598A (en) | 1997-08-11 | 1998-08-07 | Retrospective conversion |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
NL1006757 | 1997-08-11 | ||
NL1006757A NL1006757C2 (en) | 1997-08-11 | 1997-08-11 | Retrospective conversion. |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1999008226A1 true WO1999008226A1 (en) | 1999-02-18 |
Family
ID=19765480
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/NL1998/000452 WO1999008226A1 (en) | 1997-08-11 | 1998-08-07 | Retrospective conversion |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP1004090A1 (en) |
AU (1) | AU8752598A (en) |
NL (1) | NL1006757C2 (en) |
WO (1) | WO1999008226A1 (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0465818A2 (en) * | 1990-06-12 | 1992-01-15 | Horst Froesel | Mass document storage and retrieval system |
-
1997
- 1997-08-11 NL NL1006757A patent/NL1006757C2/en not_active IP Right Cessation
-
1998
- 1998-08-07 EP EP98939011A patent/EP1004090A1/en not_active Ceased
- 1998-08-07 AU AU87525/98A patent/AU8752598A/en not_active Abandoned
- 1998-08-07 WO PCT/NL1998/000452 patent/WO1999008226A1/en not_active Application Discontinuation
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0465818A2 (en) * | 1990-06-12 | 1992-01-15 | Horst Froesel | Mass document storage and retrieval system |
Non-Patent Citations (3)
Title |
---|
CAVNAR W B ET AL: "N-GRAM-BASED MATCHING FOR MULTIFIELD DATABASE ACCESS IN POSTAL APPLICATIONS", PROCEEDINGS. ANNUAL SYMPOSIUM ON DOCUMENT ANALYSIS & INFORMATION RETRIEVAL, 26 April 1993 (1993-04-26), pages 287 - 297, XP000600549 * |
ROWLEY J E: "Current awareness of competitive intelligence: a review of the options", ASLIB PROCEEDINGS, NOV.-DEC. 1992, UK, vol. 44, no. 11-12, ISSN 0001-253X, pages 367 - 372, XP002064188 * |
THOMAS P A: "From library card index to international online database: the development of ICR", ONLINE INFORMATION 87. 11TH INTERNATIONAL ONLINE INFORMATION MEETING, LONDON, UK, 8-10 DEC. 1987, ISBN 0-904933-62-8, 1987, OXFORD, UK, LEARNED INF, UK, pages 77 - 86, XP002064187 * |
Also Published As
Publication number | Publication date |
---|---|
AU8752598A (en) | 1999-03-01 |
NL1006757C2 (en) | 1999-02-12 |
EP1004090A1 (en) | 2000-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP0544430B1 (en) | Method and apparatus for determining the frequency of words in a document without document image decoding | |
US5664027A (en) | Methods and apparatus for inferring orientation of lines of text | |
EP0544434B1 (en) | Method and apparatus for processing a document image | |
US5491760A (en) | Method and apparatus for summarizing a document without document image decoding | |
Manmatha et al. | Word spotting: A new approach to indexing handwriting | |
EP0544433B1 (en) | Method and apparatus for document image processing | |
EP0544431B1 (en) | Methods and apparatus for selecting semantically significant images in a document image without decoding image content | |
EP0854433B1 (en) | Caption and photo extraction from scanned document images | |
Pal et al. | Machine-printed and hand-written text lines identification | |
Shafait et al. | Document cleanup using page frame detection | |
Ma et al. | Adaptive Hindi OCR using generalized Hausdorff image comparison | |
Saoji et al. | Text recognition and detection from images using pytesseract | |
Padma et al. | I DENTIFICATION OF T ELUGU, D EVANAGARI AND E NGLISH S CRIPTS U SING D ISCRIMINATING | |
Chaudhuri et al. | Extraction of type style-based meta-information from imaged documents | |
Niyogi et al. | An integrated approach to document decomposition and structural analysis | |
Srinivas et al. | An overview of OCR research in Indian scripts | |
Marinai et al. | Exploring digital libraries with document image retrieval | |
EP1004090A1 (en) | Retrospective conversion | |
Ramani et al. | Optical character recognition for scripts and documents | |
Kliatskine et al. | A structured method for the recognition of complex historical tables | |
Eqbal | EXTRACTION AND DETECTION OF TEXT FROM IMAGES | |
Boudraa | DLSpot: Original and Coherent Keyword Spotting System Using DTW Classifier and LBP Texture Descriptor | |
Yeotikar et al. | Script identification of text words from multilingual Indian document | |
Said | Automatic processing of documents and bank cheques | |
JPH0589279A (en) | Character recognizing device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH GM HR HU ID IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
WWE | Wipo information: entry into national phase |
Ref document number: 1998939011 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: KR |
|
WWP | Wipo information: published in national office |
Ref document number: 1998939011 Country of ref document: EP |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
NENP | Non-entry into the national phase |
Ref country code: CA |
|
WWR | Wipo information: refused in national office |
Ref document number: 1998939011 Country of ref document: EP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 1998939011 Country of ref document: EP |