EP1004090A1 - Verfahren und gerät zur rückwärtsconversion - Google Patents

Verfahren und gerät zur rückwärtsconversion

Info

Publication number
EP1004090A1
EP1004090A1 EP98939011A EP98939011A EP1004090A1 EP 1004090 A1 EP1004090 A1 EP 1004090A1 EP 98939011 A EP98939011 A EP 98939011A EP 98939011 A EP98939011 A EP 98939011A EP 1004090 A1 EP1004090 A1 EP 1004090A1
Authority
EP
European Patent Office
Prior art keywords
database
document
word
keywords
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
EP98939011A
Other languages
English (en)
French (fr)
Inventor
Johannes Van Gent
Rudie Ekkelenkamp
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nederlandse Organisatie voor Toegepast Natuurwetenschappelijk Onderzoek TNO
Original Assignee
Nederlandse Organisatie voor Toegepast Natuurwetenschappelijk Onderzoek TNO
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nederlandse Organisatie voor Toegepast Natuurwetenschappelijk Onderzoek TNO filed Critical Nederlandse Organisatie voor Toegepast Natuurwetenschappelijk Onderzoek TNO
Publication of EP1004090A1 publication Critical patent/EP1004090A1/de
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • the present invention relates to a method and device for converting information arranged on paper to an electronically structured data file or database. This conversion is also referred to as retrospective conversion.
  • Paper data files have been built up in the past for different purposes such as for instance library catalogues, birth registers, criminal records and the like, wherein large quantities of information are rendered on paper.
  • the information is herein often arranged in handwritten or typed form or in the form of stamps and the like.
  • the object of the present intention is therefore to provide a method and device with which written or typed information can be stored, structured in correct manner, in an electronic database.
  • the invention therefore relates to a method for linking written or typed information in a document to database information from a database, comprising of:
  • the present invention also comprises a device for linking written or typed information in a document to database information from a database, comprising:
  • - keyword selection means for selecting keywords from the converted document ;
  • - keyword list means for drawing up a key part-word list consisting in each case of a number of successive characters of the keywords;
  • - database word list means for drawing up a database word list of the database words occurring in the database
  • database part-word means for drawing up a database part-word list consisting in each case of the above stated number of successive characters of words from the database word list, wherein each database part- word contains a reference to the database word of which it forms part;
  • comparing means for comparing the key part-words from the document with the database part-word list
  • database word selection means for selecting, on the basis of the comparison, the database words corresponding with the keywords
  • FIG. 1 is a block diagram of a device for converting written or typed information into a structured electronic data file
  • - figure 2 is a block diagram of the method for converting written or typed information into a structured text, without the text having been corrected for errors;
  • - figure 3 is an example of a library card on which retrospective conversion must be performed;
  • - figure 4 is an example of the same library card after performing segmentation of the fore- and background;
  • - figures 5a and 5b show another example of performing segmentation of the fore- and background;
  • FIG. 6 is an example of the same library card after performing image processing steps a to j ;
  • FIG. 7 is a block diagram showing the method for indexing an external database
  • FIG. 8 is a block diagram which shows schematically the method for selecting keywords
  • FIG. 9 is a block diagram showing the method of direct linking of documents to records of a comparison database
  • FIG. 10 is an example of the comparison of trigrams, i.e. comparison of combinations of three successive characters
  • FIG. 11 is a block diagram showing the method of indirect linking of documents to records of a comparison database.
  • the preferred embodiment of the present invention relates to the conversion of data on library cards to an electronic database .
  • manual conversion to a database is too costly and too time-consuming.
  • Automatic conversion is on the other hand very difficult in view of the wide variety in structure, i.e. information such as title, author, location and the like is not positioned in an unambiguous position on the library card.
  • the variety in handwriting, variety in typefaces (fonts) used and possible tarnishing of the card such as by coffee stains and the like can also hamper the automatic conversion.
  • Figure 1 shows a preferred embodiment of a device with which the conversion can be effected.
  • the device comprises inter alia a scanner 1, with which the library cards can be read in, a connection 2 to an external database 5, a storage medium 3 for storing data and a computer 4 for controlling the data conversion.
  • Figure 2 shows a block diagram representing the method for converting written or typed information into a form readable for the computer. Linking to an external database has not yet taken place herein.
  • a library card is characterized by the following three characteristics: the card has a title description in which information is included concerning the title, author and edition of the relevant book, which information relates only to the document to which it refers and not to the specific location where the document can be found, a signature in which information is included concerning the specific location in the library where a copy of the document can be found, and a logical structure which indicates that the title description of the library card is divided into segments designating the different types of Bibliographical information.
  • the documents or library cards are converted into digital images using a scanner. These digital images can be binary, i.e. black-and-white values only, or can contain colour and/or grey tones. In order to carry out effective image processing at a later stage, digital images with grey tones and or colour tones are preferably generated. This is however not essential for the method of the present invention.
  • b. Foreground-background segmentation The quality of the digital image is subsequently enhanced by means of image enhancement techniques, such as background-foreground segmentation, wherein the text (foreground) on a card is separated from other information (background) , such as for instance coffee stains, colours and patterns in the background and the like.
  • Figure 4 shows the result of the image enhancement by foreground-background segmentation. Background information in the form of edges of the stamp of the signature is herein removed from the digital image of the library card. Shown in figures 5a and 5b is another example wherein the background pattern of the image is removed from the original image by segmentation.
  • the digital image is then subjected to automatic extraction and marking of logical components in the library card.
  • logical components are parts of the image deemed as a meaningful unit or entity by a normal user.
  • Logical components in a scientific book are for instance notes, chapter titles, paragraph titles, footnotes and the like.
  • the purpose hereof is to code each library card, hereafter also referred to as document, in accordance with a given definition of the document type (DTD) as according to SGML coding, which definition of the document type describes the logical structure of the document, in this case the library card itself.
  • DTD document type
  • SGML coding which definition of the document type describes the logical structure of the document, in this case the library card itself.
  • a macro object is built up of various micro objects or macro objects, for instance the macro object word is built up of various characters, the macro object line is built up of various words, etc.
  • a series of image processing steps is performed to extract the connected components and their characteristics, followed by marking steps in which components are grouped into macro objects and are marked as SGML elements , which will be described hereinbelow .
  • micro object can be seen as a single spot of connected pixels surrounded by "open" space (white space) .
  • An example of a micro object is a single letter "e”.
  • the letter "i" is built up of two objects, i.e. the dot and the rest of the letter. Each object will be described by a frame formed by the left and top coordinate of the object and the width and height of the object.
  • the result of micro object extraction is a list of micro objects. d. Histogram analysis
  • domain-independent knowledge is for instance the fact that it is known a priori that words and sentences in a particular language are defined from left to right . It is therefore important to extract the relevant information from the document itself. Histogram analysis can be used for this purpose.
  • All objects determined during the micro object extraction are used in making histograms for the coverage, which is defined as the percentage of black pixels within the frame of an object, the height, which is defined by the height of the frame of an object, and surface area, defined by the surface area of the frame of an object. Histograms are also made of the entire document (thus, all objects together) . e. Marking of micro objects
  • This process is based on algorithms each making use of the statistical data originating from the histogram analysis.
  • a distinction is herein made between text objects and other objects, such as for instance photo, table or graphic objects.
  • the detection of these object is performed with a decision tree wherein use is made of various parameters. For each object is described which parameters are important and how these can be used to separate the different objects.
  • the parameters used are for instance: left-hand position of the object, top position of the object, width of the object, height of the object, width/height ratio of the object, area of the object, object pixel, coverage (number of object pixels within the frame divided by the total number of pixels within the frame) .
  • Photographs can for instance be detected because the surface area of the object is large, the coverage is high and the width/height ratio is about 0.05 to 5.
  • Text can for instance be detected due to the small size and width.
  • the heights are compared with the average height of all objects on the library card. Width is not a very reliable parameter because letters may still be joined together after having been scanned. f. Straightening documents
  • micro objects such as letters are grouped into macro objects, such as into paragraphs.
  • knowledge about the logical structure of the library card will be used together with the results of the histogram analyses.
  • Documents are usually written in horizontal and vertical direction. By now projecting all frames on a horizontal and vertical axis, determined areas in the documents running horizontally or vertically can be found in which no micro objects are situated (so-called "white rivers") . If the areas are wide or high enough, a title or column can for instance be detected. h. Marking of macro objects
  • optical character recognition is the process of reading a collection of pixels and conversion thereof into letters
  • ASCII standard optical character recognition
  • the optical character technique also provides knowledge concerning the typeface (font) , the character size and style, which information can be used again at a later stage.
  • the method for comparing the results of steps of the LSD analysis to the external database depends on the possibility of performing standard database operations on the external database. If this is not the case, no direct comparison can therefore be made. The comparison can however take place in indirect manner. On the basis of a list of all words occurring in the external database, a library lexicon is made whereby the LSD results are compared via the library lexicon instead of the external database itself.
  • Figure 7 shows in a block diagram the method for indexing the external database, or comparison database .
  • the words of the comparison database occurring in the database records are first lemmatized 5. This means that the words present in the database records are reduced to their lemma. This takes place for instance by comparing each word of the comparison database to a word list in which relations between words and their lemmas are recorded.
  • table 1 an example is given of a comparison database of two records. Record I :
  • the records of the comparison database, whether or not lemmatized, are then indexed in the following manner: A.
  • file inversion 6 wherein an inverted file is created on the basis of the records of the comparison database, whether lemmatized or not.
  • the inverted file is referred to as lemma-based-inverted-file (LBIF) 7 and contains an alphabetically ordered list of lemmas, wherein each lemma contains a reference to all database records in which the words from which they are derived occur.
  • Table 3 shows the result of inverting the two lemmatized records of the comparison database of table 2.
  • the LBIF contains all lemmas of the comparison database alphabetically ordered, wherein each lemma contains a reference to each record of the database in which it occurred.
  • trigram index 9 is made on the basis of this inverted file.
  • the trigram index contains an alphabetically ordered list of trigrams, wherein each trigram consists of three letters originating from the lemmas from the lemma-based- inverted- file and wherein references are included to all locations of the trigram in the lemma-based- inverted- file .
  • Table 4 shows the trigram index on the basis of the LBIF of table 3. *ae : 1 *am : 2
  • vector space modelling 10 wherein a vector space index (vector space model VSM index 11) is made on the basis of the records, whether or not lemmatized, of the comparison database.
  • the VSM index 11 contains per record 16 coordinates to describe the characteristics thereof.
  • the degree of similarity between the VSM index 11 (originating from the comparison database) and the title description of the library card can then be determined.
  • a weighting can also be performed which for instance takes into account the number of times a lemma occurs in a record. Weighting factors can also be determined on other grounds and vary for instance from 0 to 1 depending on the degree to which lemmas are identifying for the records . In the example it is assumed for the sake of simplicity that no weighting is performed on the coordinates, which does not however detract from the method according to the present invention.
  • FIG. 8 is a block diagram in which the method for selecting the keywords from all title descriptions is shown. For efficiency reasons only a limited number of words are selected for comparing with the comparison database . All input records, or title descriptions, are inverted 12, i.e. all words from all title descriptions are placed in an alphabetically ordered list 13 with reference to all title descriptions in which these words occur. The words of the title descriptions are preferably lemmatized herein, although this is not essential. The frequency and distribution in the different title descriptions of each word in the alphabetically ordered list is then determined 14.
  • Determining of the frequency and distribution can take place in very many different ways (see for instance ⁇ 14,5 of said book by Frakes and Baeza-Yates) .
  • the result of the method is an alphabetically ordered list 15 of keywords occurring in all title descriptions wherein a numeric value is added to each word which is a measure for its significance.
  • the alphabetically ordered keyword list 15 only contains words of this one title description.
  • performing of statistical operations is more useful.
  • frequently occurring words which are less suitable for functioning as keyword, will automatically acquire a lower degree of significance.
  • Table 6 shows such a list for the present example, wherein for each keyword a degree of significance is also given.
  • the term “Anisterdarn” has the greatest infrequency and is therefore selected as the first, most significant keyword.
  • the words of the input records, or title descriptions, are lemmatized, which results in the terms “bedrijven” and “Gebroeders” being lemmatized to respectively "bedrijf” and “broeder” .
  • Figure 9 is a block diagram showing the method for comparing the title description with the external database and linking the title description to a record from the comparison database.
  • the title descriptions in the input records are compared one by one with the information from the comparison database.
  • From the title description the most infrequent n keywords are selected 16 on the basis of the keyword list with statistical information, wherein the number of keywords n can be varied and is determined in practice.
  • the keywords are then inputted into the trigram comparison module 17.
  • the keyword "Anisterdarn” is chosen as first input into trigram comparison module 17.
  • the keyword is divided into trigrams consisting of three letters (wherein the character "*" represents a random character from the alphabet) .
  • the trigrams of the keyword words are subsequently compared to those of the above described trigram index 9.
  • Figure 10 indicates how the comparison is effected for the present example.
  • Four of the ten trigrams of "Anisterdarn” correspond with trigrams from trigram index 9 which have a reference to the second word of the LBIF 7.
  • the second word in LBIF 7 is in this case "Amsterdam” .
  • This number is greater than the number of corresponding trigrams in the other words of LBIF 7, so that in this case it is decided that the word "Anisterdarn” in the title description for recognizing corresponds with the word "Amsterdam” .
  • the decision can also be taken on other grounds.
  • a higher significance of particular trigrams relative to other trigrams can for instance be taken into consideration.
  • the trigram "tje" for instance occurs so frequently in Dutch that a lower significance can be assigned to this trigram than to another less usual trigram.
  • the trigram comparison 17 is repeated for all keywords.
  • the result is a list of keywords (or lemmas) from the LBIF corresponding to the keywords, wherein these keywords from the LBIF in turn contain a reference to the records in the comparison database in which they are situated.
  • the keywords (lemmas) from LBIF 7 corresponding with the keywords from the title description are "matched" in the VSM module 18 with the previously constructed VSM index 11 of the comparison database.
  • the title description is provided with coordinates in analogous manner as specified above in respect of the VSM index 11.
  • the degree of similarity between a record j of VSM index 11 and title description can be determined in various ways, for instance as follows:
  • the calculation of the similarity is repeated for all records of VSM index 11.
  • the record 19 from VSM index 11 with the greatest similarity is then selected as being the record which most probably corresponds with the title description of the library card.
  • the library card is therefore linked to this record of the comparison database.
  • Figure 11 shows in a block diagram the method applied for indirect comparison with the comparison database.
  • All input records, or title descriptions are inverted, i.e. all words of the title description are placed in an alphabetical list with reference to all input records, or title descriptions, in which these words occur, whereby a file index results.
  • a file index results.
  • the significance of each word and of each term in the file index is then determined.
  • the input records are subsequently fed one by one to a keyword selection module 16.
  • the keyword selection module selects in the manner as described above the n most relevant keywords of the input record.
  • Via a trigram comparison 17 on the basis of the trigram index 9 these keywords are compared with said library lexicon 20.
  • the keywords resulting from this comparison which are spelled correctly owing to the trigram comparison, are subsequently used to retrieve 21 the corresponding database records 19 of the comparison database 5 via the interface 22 of the external comparison database 5.
EP98939011A 1997-08-11 1998-08-07 Verfahren und gerät zur rückwärtsconversion Ceased EP1004090A1 (de)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
NL1006757 1997-08-11
NL1006757A NL1006757C2 (nl) 1997-08-11 1997-08-11 Retrospectieve conversie.
PCT/NL1998/000452 WO1999008226A1 (en) 1997-08-11 1998-08-07 Retrospective conversion

Publications (1)

Publication Number Publication Date
EP1004090A1 true EP1004090A1 (de) 2000-05-31

Family

ID=19765480

Family Applications (1)

Application Number Title Priority Date Filing Date
EP98939011A Ceased EP1004090A1 (de) 1997-08-11 1998-08-07 Verfahren und gerät zur rückwärtsconversion

Country Status (4)

Country Link
EP (1) EP1004090A1 (de)
AU (1) AU8752598A (de)
NL (1) NL1006757C2 (de)
WO (1) WO1999008226A1 (de)

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5109439A (en) * 1990-06-12 1992-04-28 Horst Froessl Mass document storage and retrieval system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO9908226A1 *

Also Published As

Publication number Publication date
NL1006757C2 (nl) 1999-02-12
WO1999008226A1 (en) 1999-02-18
AU8752598A (en) 1999-03-01

Similar Documents

Publication Publication Date Title
EP0544430B1 (de) Verfahren und Gerät zur Bestimmung der Wortfrequenz in einem Dokument ohne Dokumentbilddekodierung
US5664027A (en) Methods and apparatus for inferring orientation of lines of text
EP0544434B1 (de) Verfahren und Gerät zur Verarbeitung eines Dokumentbildes
US5491760A (en) Method and apparatus for summarizing a document without document image decoding
Manmatha et al. Word spotting: A new approach to indexing handwriting
EP0544433B1 (de) Verfahren und Gerät zur Dokumentbildverarbeitung
EP0544431B1 (de) Verfahren und Gerät zum Auswahl linguistisch bezeichnender Bilder in einem Dokumentbild ohne Dekodierung des Bildinhalts
EP0854433B1 (de) Auffinden von Titeln und Photos in abgetasteten Dokumentbildern
Pal et al. Machine-printed and hand-written text lines identification
Ma et al. Adaptive Hindi OCR using generalized Hausdorff image comparison
Shafait et al. Document cleanup using page frame detection
Saoji et al. Text recognition and detection from images using pytesseract
Padma et al. Identification of Telugu, Devanagari and English Scripts Using Discriminating Features
Chaudhuri et al. Extraction of type style-based meta-information from imaged documents
Niyogi et al. An integrated approach to document decomposition and structural analysis
Srinivas et al. An overview of OCR research in Indian scripts
Marinai et al. Exploring digital libraries with document image retrieval
WO1999008226A1 (en) Retrospective conversion
Kliatskine et al. A structured method for the recognition of complex historical tables
Eqbal EXTRACTION AND DETECTION OF TEXT FROM IMAGES
Boudraa DLSpot: Original and Coherent Keyword Spotting System Using DTW Classifier and LBP Texture Descriptor
Padma et al. Script identification of text words from a tri lingual document using voting technique
Yeotikar et al. Script identification of text words from multilingual Indian document
Said Automatic processing of documents and bank cheques
JPH0589279A (ja) 文字認識装置

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20000202

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): DE FR NL

GRAG Despatch of communication of intention to grant

Free format text: ORIGINAL CODE: EPIDOS AGRA

17Q First examination report despatched

Effective date: 20011217

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED

18R Application refused

Effective date: 20020607